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Abstract 


The  problem  of  database  query  optimization  is  to  select  an  efficient  way  to  process  a  query 
expressed  in  logical  terms  from  among  the  alternative  ways  it  can  be  carried  out  in  the  physical 
database.  This  thesis  presents  a  new  approach  to  this  problem,  called  semantic  query  optimization. 
The  goal  of  semantic  query  optimization  is  to  produce  a  semantically  equivalent  query  that  is  less 
expensive  to  process  than  the  original  query. 

Semantic  query  optimization  actually  transforms  the  original  query  into  a  new  one  by  means  of  a 
process  of  inference.  The  transformations  are  limited  to  those  that  yield  a  semantically  equivalent 
query,  one  that  is  guaranteed  to  produce  the  same  answer  as  the  original  query  in  any  permitted  state 
of  the  database.  This  guarantee  is  achieved  because  the  knowledge  used  to  transform  a  query  is  the 
same  knowledge  used  to  insure  the  semantic  integrity  of  the  data  stored  in  the  database.  Thus, 
semantic  query  optimization  brings  together  the  apparently  separate  research  areas  of  query 
processing  and  database  integrity. 

The  thesis  also  addresses  an  important  issue  in  current  automatic  planning  research:  production 
not  just  of  a  correct  solution  but  of  a  “good"  one,  by  means  of  an  efficient  problem  solver.  Semantic 
query  optimization  advances  the  notion  of  a  problem  reformulation  step  for  problem-solving 
programs.  In  this  step,  equivalent  statements  of  the  original  problem  are  sought,  one  of  which  may 
have  a  better  solution  than  the  original  problem.  This  method  avoids  explicit  and  possibly  costly 
analysis  of  efficiency  factors  during  planning  itself. 

Semantic  query  optimization  can  also  be  viewed  as  one  aspect  of  intelligent  database  mediation.  It 
applies  knowledge  of  a  problem  domain  and  of  the  capabilities  and  limitations  of  the  database  to 
pose  the  most  effective  and  easily  processed  queries  to  solve  a  user’s  problem. 

The  thesis  formally  defines  transformations  that  preserve  semantic  equivalence  for  queries  in  the 
relational  calculus.  In  addition,  it  identifies  several  classes  of  cost-reducing  query  transformations  for 
relational  database  queries,  and  provides  quantitative  estimates  of  the  improvements  they  can 
produce,  based  upon  widely  accepted  models  of  query  processing. 

The  thesis  also  discusses  the  design  and  implementation  of  a  system  that  carries  out  semantic  query 
optimization  for  an  important  class  of  relational  database  queries.. The  system  is  called  QU1ST, 
standing  for  QUcry  Improvement  through  Semantic  Transformation.  ' 

The  QU1ST  system  has  analyzed  a  range  of  queries  for  which  different  transformations  apply.  For 
these  queries,  QU1ST  obtains  substantial  reductions  in  the  cost  of  processing  at  a  negligible  cost  for 
the  analysis  itself. 
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Chapter  1 
Introduction 


1.1  Overview  of  the  thesis 

The  problem  of  database  query  optimization  is  to  select  an  efficient  way  to  process  a  query 
expressed  in  logical  terms  from  among  the  alternative  ways  it  can  be  carried  out  in  the  physical 
database.  This  diesis  presents  a  new  approach  to  this  problem,  called  semantic  query  optimization 
(SQO). 


The  goal  of  semantic  query  optimization  is  to  produce  a  semantically  equivalent  query  that  is 
less  expensive  to  process  than  the  original  query. 

Semantic  query  optimization  is  a  response  to  inherent  limitations  in  what  may  be  termed 
conventional  query’  optimization  methods^  ([Selinger79]  [Yao79]  [Youssefi78]).  These  mcdiods  seek  to 
exploit  efficient  padis  in  the  physical  database.  However,  it  is  not  possible  to  supply  physical  support 
for  all  logical  relationships  because  of  the  high  cost  to  maintain  that  support  when  the  database  is 
updated.  Thus,  dicrc  will  be  many  queries  that  both  involve  access  to  much  data,  and  in  which  the 
logical  relationships  are  not  well  supported  physically.  These  queries  are  expensive  to  process,  and 
conventional  techniques  are  ineffective. 

Semantic  query  optimization  actually  transforms  the  original  query  into  a  new  one  by  means  of  a 
process  of  inference.  The  transformations  are  limited  to  those  that  yield  a  semantically  equivalent 
query,  one  that  is  guaranteed  to  produce  the  same  answer  as  the  original  query  in  any  permitted  state 
of  the  database.  This  guarantee  is  achieved  because  the  '•  nowlcdgc  used  to  transform  a  query  is  the 
same  knowledge  used  to  insure  the  semantic  integrity  (McLeod76]  or  meaningfulncss  of  the  data 
stored  in  the  database.  Thus,  semantic  query  optimization  brings  together  the  apparently  separate 
research  areas  of  query  processing  and  database  integrity. 

SQO  also  addresses  an  important  issue  in  current  automatic  planning  research:  production  not 
just  of  a  correct  solution  but  of  a  “good”  one,  by  means  of  an  efficient  problem  solver. 


t  The  term  optimization  is  a  misnomer;  there  is  no  claim  that  the  least  expensive  processing  method  is  found.  However,  the 
term  is  firmly  established  in  the  literature. 
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Semantic  query  optimization  advances  the  notion  of  a  problem  reformulation  step  for 
problem-solving  programs.  In  this  step,  equivalent  statements  of  the  original  problem  are 
sought,  one  of  which  may  have  a  better  solution  than  the  original  problem  This  method 
avoids  explicit  and  possibly  cosliy  analysis  of  efficiency  factors  during  planning  itself. 

Semantic  query  optimization  can  also  be  viewed  as  one  aspect  of  intelligent  database  mediation.  It 
applies  knowledge  of  the  problem  domain  and  of  the  capabilities  and  limitations  of  the  database  to 
pose  the  most  effective  and  easily  processed  queries  to  solve  a  user’s  problem. 

As  with  most  query  optimization  work,  the  research  presented  here  deals  with  queries  using  the 
relational  model  of  data  ([Codd701,  [Kim79]).t  The  thesis  formally  defines  transformations  that 
preserve  semantic  equivalence  for  queries  in  the  relational  calculus  [Codd71].  In  add  >'•  jn,  ,t 
identifies  several  classes  of  cost-reducing  query  transformations  for  relational  database  queries,  and 
provides  quantitative  estimates  of  the  improvements  they  can  produce,  based  upon  widely  accepted 
models  of  query  processing. 

Tire  thesis  also  discusses  the  design  and  implementation  of  a  system  that  carries  out  semantic  query 
optimization  for  an  important  class  of  relational  database  queries.  The  system  is  called  QUIST, 
standing  for  QUery  Improvement  through  Semantic  Transformation. 

The  QUIST  system  has  analyzed  a  range  of  queries  for  which  different  transformations  apply  in 
the  context  of  a  simplified  query  processing  model  based  on  the  System  R  access  path  selector 
[Sclingcr79],  For  these  queries,  QUIST’s  overhead  is  negligible  compared  to  the  estimated  reduction 
of  query  processing  cost.  The  overhead  is  also  negligible  in  cases  where  QUIST  determines  that  there 
are  no  constraint  targets,  or  that  the  query  conditions  are  not  saiisfiablc.  The  latter  condition  is 
detected  without  recourse  to  actual  data,  in  contrast  to  a  similar  function  performed  by  so-called 
"cooperative  response"  systems  ([Kaplan79J,  [Janas79]). 

QUIST  uses  heuristics  to  guide  the  process  of  inference  that  produces  equivalent  queries.  The 
process  is  directed  toward  the  application  of  one  or  more  specific  types  of  transformations  on  the 
relational  query,  such  as  the  elimination  of  a  relation  or  the  introduction  of  a  constraint  on  an 
indexed  attribute.  The  only  inferences  that  take  place  are  those  that  may  produce  a  query  that  is 
more  efficient  to  process. 

QUIST’s  inference-guiding  heuristics  reflect  the  expert  knowledge  of  relational  database  stricture 
and  query  processing  developed  in  recent  query  optimization  research.  Indeed,  it  is  the  existence  of 
fairly  wide  agreement  about  models  of  query  processing  and  optimization  issues  in  the  relational 
setting  that  makes  that  setting  a  a  suitable  one  for  exploring  semantic  query  optimization. 

The  operation  of  QUIST  can  be  contrasted  with  that  of  a  conventional  query  optimizer.  A 
conventional  optimizer  (Figure  1-1)  takes  the  given  query  as  its  input.  Its  output  is  a  plan  consisting 
of  a  sequence  of  retrieval  operations  in  the  physical  database. 


t 


The  work  is  also  applicable 


lo  other  data  models  particularly  where  implementations  include  some  fast  access  paths. 
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Figure  1*1:  Operation  of  a  conventional  query  optimizer 

In  operational  terms  (Figure  1-2),  QUIST encompasses: 

•  a  problem  refornnulator 

•  a  conventional  query  optimizer 

•  a  query  selector. 

QUIST  starts  with  the  constraints  specified  in  the  input  query.  The  problem  reformulator  (Figure 
1-3)  first  determines  which  database  relations,  if  any,  are  constraint  targets.  QUIST  designates  a 
relation  as  a  constraint  target  if  it  determines  that  it  may  lower  the  cost  of  query  processing  by  finding 
additional  contraints  on  that  relation.  If  there  are  no  targets,  QUIST  merely  returns  the  original 
query  for  processing.  Otherwise,  its  problem  reformulator  next  repeats  a  cycle  of  operations  that 
produce  constraints  until  no  more  cycles  can  be  carried  out.  During  each  cycle,  relevant  semantic 
integrity  rules  are  retrieved.  QUIST  filters  the  rules  according  to  the  list  of  constraint  targets,  tests 
them  for  applicability  against  the  current  constraints,  and  asserts  new  constraints  if  possible.  The 
process  terminates  when  some  cycle  fails  to  generate  new  constraints.  Finally,  the  problem 
reformulator  gioups  the  known  constraints,  both  those  given  originally  in  the  query  and  those 
derived  using  semantic  knowledge  of  the  domain,  into  a  set  of  queries  that  arc  semantically 
equivalent  to  the  original  query. 

QUIST  next  uses  its  conventional  query  optimizer  to  estimate  the  cost  of  processing  each  of  the 
semantically  equivalent  queries.  Finally,  as  its  output  QUIST  selects  the  query  with  the  towest 
estimated  processing  cost  as  determined  by  the  conventional  query  optimizer. 

In  more  abstract  terms,  QUIST  operates  at  three  levels  that  correspond  to  the  levels  of  the  plan * 
gencratc-icst  paradigm  of  artificial  intelligence  [Fcigcnbaum71]  seen  in  such  systems  as  Mcta- 
Dcndral  [Buchanan76).  'Hie  planning  and  generating  steps  take  place  in  the  problem  reformulator, 
while  the  testing  step  is  carried  out  by  the  conventional  optimizer  and  die  query  selector. 
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Figure  1-2:  Operation  of  the  QUIST  semantic  query  optimizer 

The  planning  level  is  the  one  at  which  constraint  targets  arc  established.  At  this  level,  the 
query  is  treated  in  very  abstract  terms.  The  query’s  only  important  characteristics  are  the 
names  of  tire  relation  attributes  that  arc  constrained  or  from  which  output  values  are 
requested.  At  the  planning  level,  QUIST  divides  the  database’s  relations  into  those  that 
should  be  targets  for  inference  and  those  that  should  not,  much  in  the  way  that  the 
Dcndral  program  [Fcigcnbaum71J  divides  fragments  of  chemical  structures  into  those 
that  should  or  should  not  be  part  of  a  desired  complete  structure. 

At  the  generate  level,  QUIST  explores  a  space  of  semantically  equivalent  queries.  Each 
move  in  this  space  is  a  query  transformation  based  upon  trie  inference  of  an  additional 
constraint.  Eich  inference  is  supported  by  a  rule  in  the  semantic  knowledge  base.  Only 
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plausible  moves  arc  generated  in  the  sense  that  the  constraint  target  list  permits  only 
transformations  which  may  possibly  produce  a  lower  cost  query.  The  representation  of 
the  query  at  the  generate  level  is  less  abstract  than  at  the  planning  level.  At  the  generate 
level,  it  is  necessary  to  deal  with  the  precise  constraints  on  database  attributes,  not  just 
with  the  names  of  the  attributes  that  have  been  constrained. 

•  The  testing  level  views  each  query  in  the  most  detail.  Here,  estimates  of  actual  processing 
cost  arc  obtained.  QUIST  includes  a  conventional  query  optimizer  to  find  the  least 
expensive  way  to  process  each  semantically  equivalent  query  produced  at  the  generate 
level.  The  sequence  in  which  each  part  of  the  query  is  processed  is  an  essential  factor. 
Processing  at  this  level  can  be  regarded  as  carrying  out  moves  in  a  space  of  physical 
realizations  of  a  single  logically  expressed  query  produced  at  the  level  above. 

QUISTs  use  of  the  query  processing  expertise  developed  in  conventional  query  optimization 
research  is  seen  in  the  relationship  between  tire  searches  at  the  generate  and  testing  levels.  The 
testing  level  search  of  alternative  processing  sequences,  which  is  nothing  other  than  conventional 
query  optimization,  is  guided  by  detailed  models  of  the  cost  of  data  access.  The  generate  level  search 
of  semantically  equivalent  queries  is  guided  by  the  constraint  target  list.  The  heuristics  that  produce 
the  constraint  target  are,  in  effect,  summaries  or  abstract  versions  of  the  detailed  cost  models  used  at 
the  testing  level. 


1 .2  Background  of  the  current  research 

This  research  introduces  the  use  of  semantic  reasoning  to  address  the  problem  of  query 
optimization  in  relational  databases.  In  this  section,  we  briefly  review  the  research  on  database 
abstraction  that  has  focussed  attention  on  the  query  optimization  problem.  We  indicate  why 
relational  databases  arc  a  suitable  context  for  this  current  study,  and  we  note  how  previous 
investigators  have  defined  the  problem  in  that  context.  We  also  discuss  llic  important  ideas  about  the 
semantic  integrity  of  databases  that  suggest  the  possibility  of  semantic  reasoning  as  an  approach  to 
efficient  query  processing,  (lie  research  discussed  in  this  section  serves  to  frame  the  issues  of  live 
current  study.  We  defer  until  Chapter  6  a  discussion  of  the  significant  contributions  of  our  research 
in  the  context  of  previous  investigations. 


1.2.1  Database  abstraction  and  data  models 

The  query  optimization  problem  arises  from  the  distinction  between  the  logical  and  physical 
representations  of  a  database.  Fry  and  Sibley  (Fry76)  trace  the  evolution  of  database  abstraction 
concepts  that  has  led  to  the  current  notion  of  a  data  model  as  the  means  to  maintain  the 
logical/physical  distinction.  A  data  model  is  a  language  in  which  to  express  live  logical  stricture  of  a 
database  and  the  logical  operations  that  are  permitted  upon  that  structure.  That  is,  a  data  model  is  a 
vehicle  for  defining  a  database's  data  elements,  relationships,  and  data  types,  as  well  as  the  operations 
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on  the  elements  and  relationships.  It  is  also  the  basis  for  the  query  language  by  which  elements  of  tire 
database  can  be  specified  and  retrieved  by  their  logical  properties. 

Most  attention  has  been  paid  to  three  kinds  of  data  models.  These  arc  the  network  model 
(Taylor76],  the  hierarchical  model  [T sichritzis76],  and  the  relational  model  [Codd70].  Numerous 
database  systems  and  query  languages  have  been  based  on  each  of  these  models  (see  [Wicdcrhold77) 
for  an  extensive  wJtaloguc  of  such  systems). 


1.2.2  The  relational  model 

Whatever  the  merits  of  database  systems  based  on  the  three  major  data  models  with  respect  to  case 
of  implementation,  maintenance,  and  use,  it  has  been  persuasively  argued  [Datc77]  that  the  relational 
model  provides  die  greatest  separation  between  logical  and  physical  levels  of  a  database.  The  degree 
of  data  independence  thus  offered  has  stimulated  much  research  into  high  level  nonprocedural 
facilities  for  retrieval  and  update,  for  the  definition  of  logical  and  physical  structures  and  user  views, 
and  for  the  control  of  access,  integrity,  concurrency  and  recovery  ([Kim79]);  the  current  study  follows 
in  diat  line  of  research. 

For  simplicity,  a  relational  database  can  be  viewed  as  a  collection  of  tables  of  data.  In  this  view  the 
table  columns  arc  attributes  and  the  rows  correspond  to  individual  data  records.  There  are  no  explicit 
connections  among  the  tables,  so  manipulations  of  them  can  be  specified  simply  and  flexibly.  One 
broad  class  of  relational  data  manipulation  languages  is  based  on  the  relational  algebra  [Codd70] 
which  defines  operators  to  transform  tables  into  other  tables.  Basic  operators  include  restriction 
(horizontal  subsetting  of  a  table),  projection  (vertical  subsetting  of  a  table),  and  join  (cross  matching 
of  two  tables).  Another  broad  class  oflanguagcs  is  based  on  the  relational  calculus  [Codd71],  an 
applied  predicate  calculus. 


1 .2.3  Conventional  query  optimization 

Research  in  conventional  query  optimization  is  important  to  the  present  work  for  two  reasons. 
First,  it  has  shown  that  separating  the  logical  and  physical  aspects  of  a  database  does  not  necessarily 
result  in  inefficient  query  processing.  Secondly,  it  has  .stablishcd  a  body  of  knowledge  about  the 
factors  that  govern  the  cost  of  processing  queries,  knowledge  that  can  be  directly  applied  in  a  system 
for  semantic  query  optimization. 

As  we  shall  detail  in  Chapter  2,  research  in  conventional  query  optirr  ation  has  centered  on 
queries  built  up  from  the  basic  relational  algebra  operators  of  restriction,  projection,  and  join,  or 
from  their  equivalents  in  the  relational  calculus. 

The  starting  point  for  query  optimization  research  in  the  relational  context  was  the  analysis  of 
individual  operations.  Astrahan  and  Chamberlin  [Astrahan75J,  among  others,  studied  the  restriction 
operation.  Gotlieb  (Gotlicb75]  and  Kothnic  [Rothnic75]  are  among  those  who  investigated  the  join 
operation. 
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Query  optimization  research  builds  on  studies  of  the  simplest  queries  that  involve  all  of  the  key 
relational  operations.  These  simple  queries  involve  a  single  join  operation  between  just  two  relations. 
The  most  important  of  these  studies  from  the  standpoint  of  establishing  the  necessary  elements, 
underlying  conventional  query  optimization,  are  those  of  Blasgen  and  Eswaren  [Blasgcn77]  and  Yao 
and  DeJong  [Yao78].  These  studies  produced  sets  of  query  processing  methods  for  the  simple 
queries,  along  with  cost  formulas  and  applicability  conditions  for  the  methods. 

Most  recently  has  come  the  development  of  optimizers  for  the  general  class  of  rcstriction-join- 
projcction  queries.  The  building  blocks  of  these  optimizers  are  the  methods  to  process  the  single-join 
queries.  The  key  insight  underlying  the  optimizers  is  to  see  the  evaluation  of  the  general  query  in 
terms  of  a  sequence  of  evaluations  of  simple  queries.  The  task  of  the  general  optimizer  is  to  choose 
which  simple  methods  to  apply,  and  in  what  sequence.  The  query  optimizers  of  INGRES 
[Youssefi78]  and  of  System  R  [Sclinger79]  can  be  viewed  as  operating  in  this  manner. 


1 .2.4  Semantic  integrity  of  databases 

The  idea  of  semantic  query  optimization  presented  in  this  thesis  rests  squarely  on  the  concept  that 
a  database  should  be  an  accurate  reflection  of  some  real  world  application,  not  just  any  collection  of 
data  values.  If  the  database  contains  values  that  cannot  be  attained  in  its  real  w  orld  application,  then 
there  is  said  to  be  a  violation  of  the  semantic  integrity  of  the  database.  Semantic  query  optimization 
relics  on  a  knowledge  base  of  rules  that  arc  not  part  of  the  database  proper,  but  that  describe  what 
values  in  the  database  correspond  to  possible  states  of  the  real  world  application. 

In  early  database  systems,  integrity  checks  were  confined  to  the  detection  of  errors  in  format  or 
were  implemented  as  ad-hoc  procedures  incorporated  in  general  database  updating  routines.  A  more 
systematic  approach  to  the  classification,  detection  and  treatment  of  semantic  integrity  violations 
arose  in  the  work  of  such  researchers  as  Eswaren  and  Chamberlin  [F,swarcn75]  and  Hammer  and 
McLeod  [Hammcr75].  Two  broad  notions  of  semantic  integrity  have  been  developed.  One  notion 
concerns  the  specification  of  permissible  states  of  the  database.  For  instance,  it  may  be  required  that 
tiic  salary  of  employees  be  no  greater  than  some  maximum  figure;  any  data  value  for  salary  that 
exceeds  that  maximum  docs  not  reflect  a  legitimate  state  of  affairs  in  the  company,  and  so  must  be 
considered  a  semantic  integrity  violation.  The  other  notion  of  semantic  integrity  concerns 
permissible  transitions  from  one  state  to  another.  For  instance,  it  may  not  be  permissible  to  reduce 
the  salary  of  an  employee,  even  if  the  salaries  before  and  after  the  change  arc  both  legitimate  salaries. 

Because  query  processing  is  assumed  to  take  place  in  a  single  state  of  the  database,  we  are  only 
concerned  in  the  present  research  with  semantically  based  constraints  on  states  of  die  database,  rather 
than  with  permissible  transitions  between  states.  Several  methods  have  been  suggested  for  expressing 
such  state  constraints  including:  as  qualifications  in  a  query  language  expression  [Stoncbraker75];  in 
a  special  constraint  language  [Mclcod76j;  in  terms  of  an  algebra  in  the  sprit  of  abstract  data  types 
(Brodic78|;  and  in  a  general  logical  formalism  such  as  predicate  calculus  [Chang78)  or  semantic 
networks  |Roussopoulos77],  The  research  presented  here  generally  adopts  the  predicate  calculus 
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approach  to  representing  semantic  integrity  constraints  adopted  by  Chang  [Chaiig78]  and  others 
[Gallairc78J. 

In  addition  to  studies  in  the  representation  of  semantic  constraints  cn  databases,  much  effort  has 
been  devoted  to  issues  of  designing  systems  for  specifying  semantic  integrity  constraints  and  for 
checking  them  efficiently  ([Hammcr78],  [Wilson80]).  The  method  devised  by  Stonebraker 
[Stoncbrakcr75]  for  maintaining  semantic  integrity  of  a  database  is  similar  to  the  method  used  by 
QUIST.  Stoncbrakcr’s  method,  called  query  modification ,  works  by  modifying  an  update  request  In 
general  terms,  it  conjoins  appropriate  integrity  constraints  to  the  qualification  portion  of  the  update 
request.  In  this  way,  no  data  is  altered  that  would  result  in  a  state  that  violates  the  conjoined 
constraint  The  query  transformations  described  in  the  present  research  are  similar  to  these  query 
modifications. 

We  have  now  concluded  our  brief  review  of  research  that  forms  the  background  to  our 
investigation  of  semantic  query  optimization.  We  shall  look  at  additional  related  research  when  we 
discuss  the  significance  of  our  results  in  Chapter  6. 

1.3  Guide  to  reading 

Semantic  query  optimization  integrates  two  important  sources  of  knowledge:  knowledge  about 
cost  factors  in  query  processing,  and  knowledge  about  the  semantics  of  the  application  task  domain. 
Chapter  2  discusses  the  problem  definitions  and  models  of  query  processing  that  characterize 
conventional  approaches  to  the  optimization  of  queries  in  relational  databases.  Chapter  3  introduces 
semantic  query  optimization.  It  presents  the  formal  basis  for  the  notion  of  a  transformation  of  a 
relational  query  that  preserves  meaning  in  all  permitted  states  of  the  database. 

In  Chapter  4,  we  describe  the  QUIST  system  in  detail.  We  show  how  the  models  of  query 
processing  developed  in  research  on  relational  databases  arc  directly  incorporated  into  heuristics  to 
guide  the  transformation  of  queries  into  less  costly,  semantically  equivalent  forms.  In  Chapter  5,  we 
discuss  the  effectiveness  of  QUIST  in  terms  of  the  estimated  reductions  in  cost  made  possible  by 
various  kinds  of  query  transformations.  We  also  report  the  results  of  using  QUIST  on  a  range  of 
queries  that  illustrate  those  classes  of  transformations,  and  we  discuss  the  stability  of  the  QUIST 
control  strategy  when  the  size  of  the  database  or  rule  base  increases. 

Finally,  in  Chapter  6  we  discuss  the  significance  of  the  work  reported  here  in  the  context  of 
research  in  database  management  and  artificial  intelligence.  We  also  review  the  limitations  of  the 
current  work  and  suggest  directions  for  future  research. 
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Chapter  2 

Query  Processing  Expertise 


The  most  implant  outcome  of  conventional  query  optimization  research^  is  deeper 
understanding  of  ttK  probier, .  of  accessing  data,  what  we  might  call  query  processing  expertise.  In  the 
terminology  of  Seer-  -  1  1,  queiy  processing  expertise  is  knowledge  of  problem-solving  methods  for 
the  specific  problem  processing  database  queries.  This  expertise  is  manifested  in  two  ways:  in 
terms  of  the  assumptions,  models,  and  approaches  that  characterize  advanced  query  optimization 
systems,  and  in  'trms  of  generalizations  concerning  the  factors  that  contribute  to  the  cost  of 
processing  queries. 

Research  in  the  optimization  of  relational  database  queries  serves  as  an  appropriate  case  study  for 
examining  query  processing  expertise.  There  is  substantial  agreement  on  the  validity  and  power  of 
the  data  storage,  data  access,  and  cost  models  developed  in  the  relational  context,  although  there  is  no 
standard  accepted  for  all  systems. 

Query  processing  expertise  is  an  essential  underpinning  of  semantic  query  optimization.  Through 
the  proper  use  of  this  knowledge,  it  is  possible  to  control  the  use  of  semantic  knowledge  in  an 
effective  semantic  query  optimization  system.  In  this  chapter,  therefore,  it  is  our  aim  to  summarize 
the  expertise  that  has  emerged  from  research  on  relational  database  query  processing.  In  so  doing, 
we  specify  the  class  of  queries  towards  which  we  have  directed  our  specific  research  in  semantic 
query  optimization. 

In  Section  2,1,  wc  review  the  basic  terminology  of  relational  databases.  In  Section  2.2,  wc  describe 
how  queries  arc  specified.  Our  objective  in  Section  2.3  is  to  specify  tire  knowledge  that  underlies 
conventional  approaches  to  optimization  of  rcstrict-join-project  queries.  This  is  tire  class  of  queries 
that  is  the  focus  of  conventional  query  optimization  research.  We  accomplish  this  objective  through  a 
detailed  review  of  some  characteristic  research  work  in  the  field.  Wc  extend  this  in  Section  2.4  to 
show  how  this  query  processing  knowledge  is  actually  used  in  a  conventional  query  optimizer. 
Finally,  in  Section  2.5  wc  make  explicit  some  of  the  generalizations  about  query  processing  that 
constitute  query  processing  expertise  and  that  play  an  integral  part  of  semantic  query  optimization. 


The  use  of  Ihc  term  optimization  in  this  context  is  discussed  in  Section  1.1. 
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2.1  Relational  databases 

In  formal  terms,  a  relational  database  is  a  collection  of  relations.  Let  D1,D2,.  ,Dn  be  n  sets.  Codd 
[Codd71]  defines  the  extended  Cartesian  product  of  these  sets  as: 

X(Dl,D2,...,Dn)  =  {(dl,d2 . dn):  djGDj  forj  =  1,2 . n}. 

R  is  a  relation  on  the  sets  D1,D2 . Dn  if  it  is  a  subset  of  the  extended  Cartesian  product  of  those  sets. 

A  relation  is  therefore  a  set,  and  its  members  are  n-tuples  (or  more  simply,  tuples )  where  n  is  referred 
to  as  the  degree  of  the  relation.  The  sets  Di  on  which  the  relation  R  is  defined  arc  called  its 
underlying  domains.  For  purposes  of  modelling  databases,  the  domains  under  consideration  are 
integers  and  character  strings.  The  number  of  n-tuples  (or  more  simply,  tuples)  in  R  is  the  cardinality 
ofR. 

A  relation  can  be  viewed  as  a  table  in  which  the  rows  correspond  to  tuples  and  the  columns 
correspond  to  mappings  from  the  relations  into  its  domains.  The  mappings  arc  called  attributes,  and 
it  is  possible  to  base  more  than  one  attribute  on  the  same  domain.  An  attribute  value  is  the  entry  for 
a  particular  row/column  combination. 

To  illustrate  the  data  definitions,  consider  a  relation  that  describes  characteristics  of  ships: 

SHlPS(Ship  Owner  Type  Length  Draft  Deadweight) 

SHIPS  is  the  name  of  a  relation.  The  words  in  parentheses  are  the  names  of  the  attributes  of  the 
relation. 

Assume  that  the  attributes  Ship,  Owner,  and  Type  arc  defined  on  the  strings,  and  that  the  other 
attributes  take  on  integer  values.  The  relation  might  consist  of  the  following  tuples  (based  on  data 
from  Lloyd's  Register  of  Ships  [L!oyds78]): 


Ship 

Owner 

Type 

Length 

Draft 

Deadweight 

Bralanta" 

"Braathan” 

"Tanker" 

285 

17 

154 

British  Wye" 

”BP  Shipping" 

"Tanker" 

171 

9 

25 

Carlova" 

"Index  Maritime" 

"Bulk” 

218 

12 

55 

'George  F.  Getty” 

"Hemisphere" 

"Tanker" 

319 

19 

227 

'Intellect  Energy” 

"Energy  Shipping" 

"Tanker" 

88 

6 

2 

Figure  2-1:  Illustrative  tuples  in  the  SHIPS  relation 
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Wc  take  the  first  tuple  to  signify  that  the  tanker  Bralanta  is  owned  by  Braathan,  that  it  is  285 
meters  long,  has  a  draft  of  17  meters,  and  has  a  deadweight  (size)  of  154,000  long  tons.  The  other 
tuples  arc  interpreted  similarly. 

2.2  The  specification  of  relational  database  queries 

There  are  two  broad  classes  of  relational  query  languages',  those  based  upon  the  relational  algebra 
and  those  based  upon  the  relational  calculus.  An  extensive  discussion  and  comparison  of  the  two 
classes  can  be  found  in  [Date77],  In  a  relational  algebra  language,  a  query  is  expressed  by  specifying 
operators  that  transform  relations  into  other  relations,  and  ultimately  into  the  desired  result  relation. 
The  relational  calculus  is  an  applied  predicate  calculus.  Query  languages  based  on  the  relational 
calculus  specify  retrieval  in  terms  of  a  calculus  expression.  Pirotte  (Pirotte78]  gives  an  excellent 
survey  of  the  kinds  of  relational  query  languages  that  are  based  on  the  predicate  calculus. 

The  languages  based  upon  the  relational  algebra  and  the  languages  based  upon  the  relational 
calculus  present  different  interfaces  to  a  user  or  a  program.  However,  Codd  demonstrated  [Codd71] 
that  the  two  formalisms  are  equivalent.  In  that  same  paper,  Codd  proposed  the  relational  calculus  as 
a  standard  against  which  the  expressive  power  of  query  languages  could  be  measured. 

We  now  illustrate  the  specification  of  relational  queries  using  a  language  based  on  the  relational 
calculus,  the  SODA  language  [Moorc79].  Wc  note  that  a  query  does  three  things: 

1.  It  specifies  what  relations  are  involved  in  the  query,  either  for  checking  conditions  or  for 
retrieving  specific  values. 

2.  It  specifics  what  conditions  must  be  met. 

3.  It  specifies  what  aspects  of  the  qualifying  data  items  arc  to  be  retrieved. 

Our  illustration  uses  a  simple  example  relational  database  that  includes  two  relations: 

SHlPS(Ship,  Length) 

CARGOES(Ship,  Cargotype,  Quantity) 

These  relations  contain  information  about  ships  and  the  cargoes  they  carry.  Suppose  it  is  desired 
to  retrieve  the  names  of  ships  longer  than  200  meters  that  arc  carrying  more  than  1000  tons  of  wheat. 
The  appropriate  query  in  SODA  could  be  expressed  as: 
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(IN  VI  SHIPS)  (IN  V2  CARGOES) 

((VI  Ship)  =  (V2  Ship)) 

((VI  Length)  >  200)  ((V2  Cargotype)  =  “Wheat”)  ((V2  Quantity)  >  1000) 

(?  (VI  Ship)) 

The  first  line  defines  the  ranges  of  two  tuple  variables  VI  and  V2.  A  tuple  variable  is  a  variable 
that  ranges  over  the  tuples  of  a  specified  relation.  For  each  tuple  variable,  the  query  specifics  the 
relation  over  which  the  variable  ranges.  Thus,  VI  ranges  over  the  SHIPS  relation,  and  V2  ranges  over 
the  CARGOES  relation. 

The  next  two  lines  of  the  query  specify  the  retrieval  qualification.  What  are  the  objects  on  which 
the  qualification  is  to  be  tested?  They  are  precisely  the  tuples  of  the  relation  formed  by  taking  the 
Cartesian  product  of  the  relations  over  which  the  tuple  variables  range.  The  formation  of  the 
Cartesian  product  is  conceptual.  That  is,  it  may  not  actually  be  necessary  to  form  the  product 
completely  before  applying  qualifying  conditions.  Indeed,  it  is  advisable  on  efficiency  grounds  not  to 
form  the  product.  Each  tuple  in  the  Cartesian  product  of  the  example  query  is  the  concatenation  of  a 
SHIPS  tuple  with  an  CARGOES  tuple. 

The  first  line  of  the  qualification  contains  the  join  term  ((VI  Ship)  =  (V2  Ship)).  A  join  term  has 
the  form  (X  COMP  Y)  where  X  and  Y  are  attribute  specifiers ,  and  COMP  is  one  of  the  comparison 
operators  such  as  =,  >,  and  so  forth.  An  attribute  specifier  is  a  (tuple  variable,  attribute  name) 
pair;  it  is  the  same  thing  as  the  indexed  tuple  referred  to  in  [Codd71],  but  restricted  to  a  single  relation 
and  a  single  attribute.  The  attributes  specified  by  X  and  Y  must  be  defined  on  the  same  underlying 
domain.  Roughly  speaking,  the  join  terms  of  our  example  query  pairs  each  cargo  with  the  ship  that  is 
carrying  it  by  equating  the  names  of  ships. 

The  next  line  contains  three  restriction  terms ,  ((VI  Length)  >  200),  ((V2  Cargotype)  =  “Wheat”), 
and  ((V2  Quantity)  >  1000),  dial  further  restrict  the  subset  of  the  Cartesian  product  that  passes  the 
join  term  test.  A  restriction  term  is  of  the  form  (X  COMP  CONSTANT)  where  X  and  COMP  are  as 
before  and  CONSTANT  is  a  constant  in  the  domain  of  the  attribute  specified  by  X.  In  our  example, 
the  SHIPS  portion  of  qualifying  Cartesian  product  tuples  must  have  a  Length  value  greater  than  200, 
and  the  CARGOES  portion  must  have  a  Cargotype  value  of  “Wheat”  and  a  Quantity  value  greater 
than  1000.  Note  that  there  is  an  implicit  conjunction  among  the  join  and  restriction  terms. 

The  tuples  of  the  specified  Cartesian  product  relation  that  satisfy  the  retrieval  qualification  arc 
called  the  qualifying  tuples.  Hie  final  task  of  the  query  is  to  say  what  information  is  sought  from  the 
qualifying  tuples.  The  desired  output  is  specified  in  a  target  list.  A  target  list  is  a  list  of  attribute 
specifiers.  Each  attribute  specifier  in  the  target  list  requests  the  retrieval  of  the  value  of  the  specified 
attribute  for  all  qualifying  tuples.  The  final  line  of  the  example  query  specifics  the  retrieval  of  the 
ship  name  Ship  from  the  SHIPS  portion  of  the  qualifying  tuples. 
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In  a  standard  relational  calculus  language,  a  retrieval  qualification  is  a  logical  combination  of 
restriction  terms  and  join  terms,  using  the  standard  logical  connectives  and  existential  and  universal 
quantification.  The  reader  is  referred  to  (Codd71J  or  [Pirottc78]  for  a  thorough  discussion  of 
allowable  qualifications.  Intuitively,  the  join  terms  of  a  qualification  most  often  correspond  to 
semantic  relationships  among  entities  and  the  restriction  terms  are  additional  restrictions  on  the 
entities  so  related. 

2.3  Optimization  of  simple  restrict-join-project  queries 

Experience  with  relational  database  systems  indicates  that  there  is  a  subset  of  the  relational  algebra 
in  which  a  very  great  percentage  of  queries  can  be  expressed.  This  subset  has  therefore  become  the 
main  focus  of  conventional  query  optimization  research  in  relational  databases. 

In  this  section,  we  look  in  detail  at  a  characteristic  example  of  this  research,  the  work  of  Blasgen 
and  Eswarcn  [Blasgcn77]  at  IBM.  Our  purpose  is  to  reveal  die  foundation  elements  of  conventional 
query  optimization  that  have  been  generally  accepted  by  investigators  in  the  field.  These  elements 
arc  noted  in  Figure  2-2. 

1.  A  limited  but  important  class  of  queries. 

2.  A  model  of  data  storage. 

3.  A  model  of  access  to  data. 

4.  A  cost  measure  related  to  data  access. 

5.  A  set  of  methods  to  carry  out  the  “atomic"  queries. 

6.  System  and  query  parameters  that  are  used  in  cost  analysis. 

7.  Cost  formulas  and  applicability  conditions  for  the  methods. 

Figure  2-2:  F.lcments  of  conventional  query  optimization 

The  queries  under  consideration  are  those  that  can  be  expressed  in  terms  of  the  three  basic 
relational  algebra  operations:  restriction,  which  selects  rows  from  a  relation;  projection,  which  selects 
columns  from  a  relation;  and  join,  which  matches  (cross-references)  two  relations  on  compatible 
attributes  (attributes  defined  on  the  same  underlying  domain  of  values).  Note  that  die  discussion 
applies  to  the  relational  calculus  too  because  a  corresponding  class  of  queries  in  dim  formalism  can  be 
translated  into  these  algebraic  terms  (sec  [Yao79J,  for  example).  The  corresponding  class  in  relational 
calculus  terms  involves  range  statements  for  tuple  variables,  plus  a  qualification  in  terms  of  those 
variables,  which  together  correspond  to  restriction  and  join,  and  a  target  clause  which  corresponds  to 
a  projection. 
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The  work  reported  in  [Blasgcn77]  considers  a  general  query  with  die  following  form:  apply  a  given 
restriction  to  relation  R,  yielding  R.'and  apply  a  possibly  different  restriction  to  relation  S,  yielding  S' 
Join  R  and  S'to  form  a  (new)  relation  T,  and  project  some  columns  from  T.  This  query  can  be  termed 
a  iwo-relation  query ,  and  can  be  viewed  as  the  atomic  unit  from  which  all  the  rcstrict-join-projcct 
queries  can  be  constructed. 

Blasgcn  and  Eswaren  propose  straightforward  models  of  access  and  storage.  Because  these  models 
are  typical  of  the  access  and  storage  models  used  throughout  conventional  query  optimization 
research,  we  will  describe  them  in  detail. 

The  database  is  assumed  to  be  stored  on  direct  access  secondary  storage,  typically  on  disk.  Physical 
storage  is  divided  into  fixed  length  pages.  There  are  two  kinds  of  pages,  data  pages  and  index  pages. 
The  tuples  of  the  database  relations  are  stored  as  fixed  length  records  on  the  data  pages  (under  this 
assumption,  the  terms  tuple  and  record  are  used  interchangeably  throughout  the  query  optimization 
literature).  A  data  page  may  contain  tuples  from  more  than  one  relation,  but  no  tuple  is  broken  up 
across  page  boundaries.  Each  tuple  has  a  unique  tuple  identifier  (T1D).  It  is  assumed  that  the  file 
system  can  convert  a  TID  into  an  address  for  direct  access  to  a  tuple.  The  TID's  have  the  property 
that  accessing  a  set  of  tuples  in  sorted  TID  order  accesses  a  data  page  at  most  once. 

Secondary  storage  is  divided  into  segments.  A  segment  is  a  large  address  space  that  contains  one  or 
more  relations.  It  is  implemented  as  a  set  of  pages.  Each  tuple  stored  in  a  segment  identifies  the 
relation  to  which  it  belongs.  No  relation  is  broken  up  across  segment  boundaries.  To  obtain  all  the 
tuples  of  a  relation,  the  segment  can  be  scanned  by  fetching  its  pages  one  at  a  time  and  checking 
every  tuple  on  the  page  for  membership  in  the  desired  relation.  This  kind  of  scan  is  called  a  segment 
scan.  A  segment  scan  fetches  every  page  in  the  segment  once. 

Because  a  segment  can  be  large,  a  segment  scan  can  be  very  slow.  For  this  reason,  other  access 
paths  to  the  tuples  of  a  relation  may  be  arranged.  The  model  described  in  [Blasgcn77]  admits  an 
access  path  based  on  a  single-column  index  to  a  relation  (another  type  of  access  path  is  the  link).  A 
single-column  index  on  a  column  A  of  a  relation  R  is  a  set  of  pairs  whose  first  component  is  a  value 
from  A  and  whose  second  component  is  the  TID  of  a  tuple  of  R  that  has  that  value.  The  index  is 
stored  as  a  B-tree  of  pages.  Pages  at  the  lowest  level  contain  the  actual  (key,  TID)  pairs  sorted  by  key. 
Higher  levels  contain  pointers  to  lower  level  pages.  Because  of  the  B-trce  organization,  the  index 
permits  rapid  access  to  a  single  tuple  with  a  desired  value.  The  number  of  index  pages  to  be  fetched 
equals  the  height  of  the  tree.  Also,  the  lowest  level  index  pages  are  linked  so  that  all  the  tuples  or  any 
key  subsequence  of  them  can  readily  be  retrieved  in  sorted  key  order  by  scanning  the  leaf  nodes  of 
the  index.  This  operation  is  called  an  index  scan. 

The  usefulness  of  an  index  for  query  evaluation  depends  upon  whether  the  relation  is  clustered 
with  respect  to  the  index  and  on  the  number  of  tuples  to  be  retrieved.  A  relation  is  clustered  with 
respect  to  an  index  if  the  tuples  of  the  relation  are  stored  in  the  same  sequence  as  the  key  sequence 
given  by  the  index.  With  such  an  arrangement,  if  the  index  is  used  to  access  tuples  of  the  relation 
then  each  data  page  of  the  relation  will  be  fetched  only  once.  On  the  other  hand,  if  the  index  is 
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unclustcrcd  with  respect  to  the  relation,  it  is  assumed  that  each  access  of  a  tuple  using  the  index 
requires  fetching  a  new  data  page. 

Besides  segment  scan  and  index  scan,  the  access  model  includes  sorting  tuples  on  the  value  of  some 
column.  It  is  assumed  that  the  files  are  large  enough  to  require  an  external  Z-way  sort-merge. 

Based  on  the  access  and  storage  models,  Blasgen  and  Eswaren  develop  methods  to  carry  out  the 
two-relation  query.  Hie  methods  differ  in  their  use  of  TiDs,  indexes,  and  sorting,  and  some  of  them 
arc  only  applicable  if  certain  indexes  exist.  Two  of  the  methods  will  be  described. 

The  first  method  is  the  nested  loops  method.  The  first  relation  is  scanned.  For  each  tuple  that 
meets  the  restrictions  on  that  relation,  a  scan  of  the  second  relation  is  performed.  Some  tuples  from 
the  second  relation  may  be  found  that  meet  the  restrictions  on  the  second  relation  and  that  have  the 
same  join  column  value  as  the  current  first  relation  tuple.  Each  qualifying  second  relation  tuple  is 
combined  with  the  current  first  relation  tuple  to  form  a  composite  result  tuple  (projecting  out  the 
desired  columns). 

The  second  method  is  the  merging  scans  method.  Both  relations  must  be  scanned  in  join  column 
order.  Either  relation  that  is  not  indexed  on  its  join  column  must  be  sorted  into  a  temporary  file  that 
is  ordered  on  that  column.  The  first  relation  is  scanned  in  join  column  order.  For  each  first  relation 
tuple  that  meets  the  restrictions,  the  second  relation  is  scanned.  However,  because  of  join  order 
sequencing,  it  is  possible  to  keep  track  of  the  current  position  of  the  two  scans  and  never  rescan  any 
portion  of  either  relation  once  the  current  join  column  value  exceeds  the  value  in  that  portion.  This 
bookkeeping  also  makes  it  possible  to  spot  situations  where  tuples  in  one  relation  have  no  join 
partners  in  the  other  relation. 

The  cost  measure  for  the  methods  is  die  number  of  pages  that  must  be  brought  in  from  secondary 
storage.  This  is  a  reasonable  assumption  if  it  is  believed  that  input/output  time  dominates  processor 
time.  Most  (though  not  all)  query  optimization  models  make  this  assumption.  . 

Given  a  set  of  methods  and  a  cost  measure,  it  is  possible  to  develop  cost  formulas  for  the  methods. 
The  formulas  depend  upon  system  parameters  that  arc  database- dependent  but  independent  of  the 
specific  query,  and  upon  other  query-dependent  parameters.  The  cost  formulas  for  a  method  to 
process  a  complete  two-relation  query  are  built  from  cost  formulas  for  scanning  a  single  relation.  For 
a  segment  scan,  the  cost  is  the  number  of  pages  in  the  segment  that  contains  the  relation.  This  is 
obviously  a  system  parameter,  not  related  to  the  restrictions  or  other  aspects  of  the  query. 

Unlike  a  segment  scan,  the  cost  of  an  index  scan  depends  both  oil  system  parameters  and  on  query 
parameters.  To  see  this,  consider  a  scan  of  a  key  subsequence  of  column  A  of  relation  R  using  index 
1.  Suppose  the  scan  starts  at  column  value  VI  and  ends  at  column  value  V2.  That  is,  the  aim  is  to 
retrieve  all  tuples  of  relation  R  that  have  a  value  for  column  A  that  is  greater  than  or  equal  to  VI  and 
less  than  or  equal  to  V2, 

'(he  first  step  of  llic  scan  is  to  locate  the  first  tuple  with  a  value  between  VI  and  V2.  The  index 
permits  rapid  access  to  that  tuple,  at  the  cost  of  fetching  a  number  of  index  pages  equal  to  the  height 
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of  the  tree  in  which  the  index  is  stored;  index  tree  height  is  clearly  a  system  parameter.  The  rest  of 
the  operation  consists  of  chaining  along  die  leaf  index  pages  until  the  key  value  exceeds  V2.  At  each 
index  page,  TIDs  for  qualifying  tuples  are  found  and  their  data  pages  are  fetched.  The  number  of 
leaf  index  pages  and  the  number  of  data  pages  fetched  depends  upon  the  number  of  tuples  that  have 
a  value  between  VI  and  V2  for  column  A.  This  obviously  depends  upon  the  query  because  VI  and 
V2  are  specified  by  the  query. 

It  is  straightforward  to  combine  scan  cost  formulas  into  cost  formulas  for  a  complete  two-relation 
query  processing  method.  For  example,  the  nested  loops  method  consists  of  a  scan  of  one  relation 
and  for  each  of  its  qualifying  tuples,  a  scan  of  the  second  relation.  Hence,  the  cost  of  the  nested  loops 
method  is  the  cost  of  scanning  the  first  relation  plus  the  product  of  the  number  of  qualifying  first 
relation  tuples  with  the  cost  of  scanning  the  second  relation. 

The  work  of  S.B.  Yao  and  his  associates  presented  in  [Yao78]  and  [Yao79]  rests  on  the  same 
elements  as  the  work  of  Blasgen  and  Eswaren.  In  particular,  Yao's  work  addresses  the  same  class  of 
two-relation  rcstrict-join-projcct  queries  and  presents  similar  storage,  access,  and  cost  models.  The 
work  is  significant  in  systematically  building  the  query  processing  methods  out  of  a  comprehensive 
set  of  submethods.  This  results  in  a  much  larger  set  of  query  processing  methods  than  Blasgen  and 
Eswaren  present  Yao  also  investigates  the  use  of  links  as  auxiliary  access  paths. 

2.4  A  conventional  query  optimizer  for  multifile  queries 

The  methods  that  have  been  developed  to  handle  two-relation  queries  in  the  rcstrict-join-project 
class  have  been  extended  to  handle  queries  that  involve  n  relations,  where  n  is  greater  than  two.  This 
is  the  basis  for  the  general  query  optimizer  for  IBM's  System  R  experimental  relational  database 
management  system  [Sclinger79].  The  optimization  methods  for  die  INGRES  relational  database 
system  can  also  be  viewed  in  this  framework  for  most  queries  [Youscffi78].  We  illustrate  the 
functioning  of  n-rclation  conventional  optimizers  with  the  System  R  optimizer.  The  discussion  omits 
some  aspects  of  optimization  that  arc  specific  to  System  R,  such  as  the  possible  requirement  to 
present  results  in  a  specified  sequence  or  grouping. 

Processing  a  query  that  involves  N  relations  is  viewed  as  processing  a  sequence  of  queries  that 
involve  two  relations.  this  view,  a  two-relation  subquery  is  processed  to  form  a  resulting  composite 
relation.  This  relation  is  processed  with  a  third  relation  to  form  a  new  composite,  and  the  sequence 
continues  until  all  relations  in  the  original  query  have  been  brought  in.  In  the  actual  processing,  it  is 
not  always  necessary  to  form  and  store  the  complete  composite  relation  before  the  next  relation  is 
brought  in.  Instead,  when  a  composite  tuple  is  formed  from  a  two-relation  query,  it  can  be  joined 
with  tuples  from  a  third  relation,  and  so  forth.  Intermediate  composite  relations  arc  stored  only  if  a 
sorting  operation  is  required  in  connection  with  the  next  two-relation  processing  step. 

The  extension  from  two-relation  queries  to  N-rclation  queries  outlined  above  has  been  termed 
iterative  composition  by  Kim  [Kim79],  The  task  of  the  general  query  optimizer  based  upon  iterative 
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composition  of  subqucrics  is  twofold.  First,  it  is  necessary  for  the  optimizer  to  choose  the  order  in 
which  the  relations  arc  to  be  brought  in;  that  is,  it  must  choose  the  sequence  of  two-relation 
subqucrics.  Second,  the  optimizer  must  choose  a  method  to  carry  out  each  subquery. 

The  sequence  of  subqucrics  is  important  in  determining  the  overall  cost,  even  though  the  size  of 
the  result  is  the  same  regardless  of  the  processing  sequence.  For  a  query  that  involves  N  relations, 
there  are  N  factorial  permutations  of  the  processing  sequence.  However,  the  method  to  bring  in  the 
K+ltli  relation  is  independent  of  the  way  the  first  K  relations  arc  processed.  The  search  of 
sequences  can  therefore  operate  efficiently  by  finding  good  sequences  for  successively  larger  subsets 
of  the  relations  in  the  query.  The  System  R  optimizer  uses  another  heuristic  to  reduce  the  number  of 
permutations  it  considers:  a  relation  is  considered  as  the  next  one  to  be  brought  in  only  if  it  is 
involved  in  a  join  with  one  that  has  already  been  processed. 

The  System  R  optimizer  grows  a  processing  search  tree  by  iteration  on  the  number  of  relations 
involved  so  far.  First,  the  best  method  is  found  to  scan  each  relation.  Next,  the  best  way  is  found  to 
involve  the  first  relation  in  a  two-relation  query  with  a  second  relation.  This  continues  until  all 
relations  arc  involved.  Unpromising  paths  of  the  search  tree  are  pruned  on  the  basis  of  the  heuristics 
described  above  and  on  the  basis  of  estimated  processing  costs  for  the  partially  worked  out  queries. 

An  important  source  of  information  for  the  optimizer  is  the  estimated  selectivity  of  the  query 
restrictions,  the  only  place  in  the  optimizer  where  semantic  information  is  used.  The  selectivity  of  a 
restriction  on  a  relation  is  the  fraction  of  tuples  of  the  relation  that  satisfy  the  restriction.  Both  the 
cost  formulas  for  certain  scans  and  the  formulas  for  combined  methods  use  the  fraction  of  tuples  that 
meet  the  restrictions  imposed  by  the  query.  To  estimate  selectivity,  the  optimizer  uses  information 
about  the  range  of  values  for  attributes,  if  that  information  is  available.  It  makes  the  simple 
assumption  that  the  values  for  any  attribute  arc  uniformly  distributed  within  the  legal  range  and  that 
the  distribution  of  values  is  known  with  sufficiently  fine  granularity.  This  assumption  enables 
estimates  to  be  made  with  limited  statistics  on  database  values.  Youseffi  [Youseffi78]  has  looked  into 
the  issue  of  how  additional  statistics  can  improve  die  estimates,  but  the  simple  System  R  methods 
appear  to  work  fairly  well  |Astrahan80a].  In  the  absence  of  value  range  statistics,  the  System  R 
optimizer  makes  arbitrary  although  intuitively  plausible  estimates. 

2.5  Generalizations  about  query  proce;  ling 

In  this  section,  we  review  some  general  conclusions  about  relational  query  processing  that  can  be 
drawn  from  the  kind  of  research  described  above.  As  we  shall  see,  these  general  conclusions  play  an 
important  role  in  the  design  of  an  effective  semantic  query  optimization  system. 

The  net  result  of  conventional  query  optimization  research  is  an  appreciation  of  how  the 
relationship  between  the  constraints  specified  by  a  query  and  the  data  structures  comprising  the 
database  affect  die  cost  of  processing.  In  many  eases,  this  knowledge  is  represented  in  the  choice  of 
relevant  system  parameters  and  in  the  cost  formulas  based  upon  them.  Occasionally,  though,  the 
knowledge  is  expressed  as  general  statements  about  the  interaction  of  queries  and  structures. 
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A  key  factor  in  the  cost  of  processing  a  query  is  the  physical  clustering  of  logically  related  items, 
what  Wicdcrhold  [Wicdcrhold77J  refers  to  as  the  binding  of  die  data  semantics.  While  this  seems 
intuitively  obv  ious,  tire  studies  of  Blasgcn  and  Eswaren  demonstrate  the  degree  of  its  importance,  and 
they  relate  it  to  specific  kinds  of  queries  and  specific  structures  in  the  storage  model.  The  role  of 
clustered  indexes  is  highlighted.  As  the  System  R  experiments  [Astrahan80aJ  confnm,  the  case  of  an 
equality  predicate  on  an  indexed  but  nonclustered  attribute  is  about  the  only  ease  in  which  a 
nonclustered  index  scan  is  preferred  over  a  clustered  one.  In  general,  a  query  whose  constraints 
permit  the  exploitation  of  clustered  access  paths,  whether  indexes  or  links,  can  be  answered  much 
more  efficiently  than  a  query  whose  constraints  do  not  permit  those  paths  to  be  exploited. 

Conventional  query  optimization  pays  attention  to  avoiding  catastrophically  bad  processing 
methods.  The  classic  example  of  a  bad  processing  method  is  processing  a  join  as  a  Cartesian  product 
followed  by  a  restriction.  In  one  of  the  rare  glimpses  into  the  explicit  reasoning  of  experts  in  query 
processing,  Youseffi  and  Wong  [Youssefi79]  discuss  the  formulation  of  processing  strategics  based  on 
this  consideration.  They  note  that,  intuitively,  the  processing  cost  for  a  one-variable  query  is  linear  in 
the  size  of  the  relation,  while  the  cost  for  a  two-variable  query  increases  faster  than  linearly  in  the 
sum  of  the  relation  sizes.  This  line  of  reasoning  suggests  to  them  that  it  is  nearly  always  advantageous 
to  restrict  the  individual  relations  prior  to  checking  the  join  condition,  that  is,  prior  to  accessing  and 
considering  the  relations  together.  An  exception  occurs  when  one  of  the  relations  is  physically 
clustered  with  respect  to  the  join  condition.  Other  factors  to  be  considered  are  the  sizes  of  the 
relations  and  whether  the  relations  are  in  the  target  list.  In  any  event,  it  is  generally  true  that  the 
stronger  the  restrictions  that  can  be  applied  to  the  individual  relations  prior  to  carrying  out  the  join, 
the  less  expensive  is  the  overall  process. 

This  discussion  is  indicative  of  the  body  of  expertise  about  query  processing  that  has  emerged  from 
research  on  query  optimization.  To  restate  the  main  idea,  the  cost  of  processing  depends  on  the 
relationship  between  the  constraints  specified  by  a  query  and  the  data  structures  implicated  by  the 
query.  Specifically,  with  respect  to  the  fundamental  operations  discussed  in  this  chapter.  Figure  2-3 
indicates  some  representative  generalizations: 

In  Chapter  4,  we  shall  see  how  such  generalizations  are  used  to  control  the  way  semantic 
knowledge  is  used  in  semantic  query  optimization.  Before  that,  however.  Chapter  3  discusses  the 
shortcomings  of  conventional  query  optimization  and  describes  the  new  approach  of  semantic  query 
optimization. 
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•  Gl.  A  restriction  on  an  attribute  that  is  not  indexed  leads  to  an  expensive  scan. 

•  Gl.  A  restriction  (other  than  an  equality  predicate)  on  an  indexed  attribute  where  the 
index  is  not  a  physically  clustering  index  leads  to  an  expensive  scan. 

•  G3.  A  restriction  on  a  physically  clustering  index  can  be  processed  efficiently. 

•  G4.  The  cost  of  joins  generally  dominates  the  overall  cost  of  processing. 

•  G5.  A  join  between  two  large  and  weakly  restricted  relations  is  very  expensive. 

•  G6.  The  cost  of  a  join  decreases  substantially  as  the  strength  of  restrictions  on  the  joined 
relations  increases,  except  on  a  relation  which  is  clustered  with  respect  to  the  join  term 
(and  is  therefore  likely  to  be  the  “inner"  relation  of  the  join  method). 


Figure  2-3:  Generalizations  about  query  processing 
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Chapter  3 

f  Semantic  Query  Optimization 

In  this  chapter,  we  present  the  formal  basis  for  semantic  query  optimization  in  relational  databases. 

*  We  begin  in  Section  3.1  by  reviewing  the  limitations  of  conventional  query  optimization  that 
motivate  the  development  of  our  new  method.  In  Section  3.2  we  look  informally  at  the  notion  of  the 
semantic  equivalence  of  two  database  queries  that  is  at  the  heart  of  semantic  query  optimization. 
Semantic  equivalence  is  defined  in  terms  of  semantically  meaningful  states  of  the  database.  This  in 

I  turn  is  intimately  bound  up  with  with  the  semantic  integrity  constraints  associated  with  the  database. 

We  formally  define  semantic  integrity  constraints  for  relational  databases  in  Section  3.3.  In  Section 
3.4,  we  show  informally  how  one  query  can  be  transformed  into  a  semantically  equivalent  one  using  a 
semantic  integrity  constraint.  Section  3.5  synthesizes  the  preceding  sections  into  a  foimal  definition 
of  relational  database  query  transformations  that  preserve  semantic  equivalence.  Finally,  Section  3.6 

*  discusses  additional  logical  equivalence  transformations  that  can  be  used  in  conjunction  with 
semantic  equivalence  transformations  to  reduce  the  cost  of  processing  a  query. 

3.1  The  limits  of  conventional  query  optimization 

I 

Conventional  query  optimization  research  has  identified  a  set  of  problems,  has  produced  useful 
models  of  data  storage  and  file  operations,  has  yielded  insights  into  the  factors  that  influence  the  cost 
of  query  processing,  and  has  in  general  lent  support  to  the  belief  that  high  level  query  languages  can 

I  be  used  with  acceptable  efficiency. 

The  difficulty  with  conventional  query  optimization  remains  the  lack  of  correspondence  between 
die  logical  relationships  referenced  in  a  query  and  the  physical  relationships  of  tire  data  that  represei  • 
them.  One  can  view  the  manipulations  of  a  conventional  query  optimizer  as  a  hunt  for  opportunities, 

l  for  those  parts  of  the  query  in  which  the  logical  structure  corresponds  well  to  die  supporting  physical 

structure.  For  instance,  the  presence  of  indexes  on  the  joining  attributes  for  two  files  in  a  multifile 
query  is  likely  to  make  that  join  a  candidate  for  processing  before  other  joins.  The  logical/physical 
correspondences  are  exploited  to  reduce  as  much  as  possible  the  size  of  the  data  structures  that  must 
be  handled  without  suitable  physical  support 

To  maintain  physical  support  for  all  logical  relationships  is  not  possible,  however.  The  costs  to 
maintain  that  support  as  the  database  evolves  arc  too  great.  In  simplest  terms,  if  a  query  involves  a 
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large  amount  of  data  in  logical  relationships  that  arc  not  well  structured  in  the  physical  database,  then 
the  query  can’t  be  processed  efficiently.  Examples  of  poor  correspondence  are  easy  to  imagine.  A 
constraint  on  an  unindexed  attribute  of  a  relation  forces  a  sequential  scan.  A  jo>n  between  two  files 
for  which  suitable  indexes  or  links  do  not  exist  forces  a  cross  matching  process  which  is  almost  always 
very  expensive. 

Conventional  query  optimization  is  limited  to  treating  the  logical  restrictions  of  the  query  as  fixed. 
If  the  query  restrictions  cannot  be  processed  efficiently,  nothing  can  be  done. 

3.2  The  semantic  equivalence  of  queries 

We  now  begin  the  formal  description  of  semantic  query  optimization,  developed  as  a  response  to 
the  limitations  we  have  just  described.  The  key  idea  is  that  the  given  query  restrictions  are  not 
regarded  as  fixed,  but  as  perhaps  only  one  of  several  equivalent  ways  to  pose  the  same  question.  We 
said  in  Section  3.1  that  conventional  query  optimization  is  a  hunt  for  opportunities.  The  goal  of 
semantic  query  optimization  is  to  create  new  search  spaces  in  which  to  hunt  for  such  opportunities. 

The  heart  of  semantic  query  optimization  is  the  process  of  transforming  a  query  into  a  semantically 
equivalent  one.  Two  queries  are  considered  to  be  semantically  equivalent  if  they  result  in  the  same 
answer  in  any  state  of  the  database  that  conforms  to  the  semantic  integrity  constraints  (sec  Section 
1.2.4). 

Semantic  equivalence  is  not  the  same  as  logical  equivalence.  Two  queries  are  logically  equivalent  if 
the  qualifications  of  one  can  be  transformed  into  the  qualifications  of  the  other  by  the  application  of 
standard  logical  equivalences  such  as  Dc  Morgan's  Laws.  Another  way  to  put  this  is  that  two  queries 
arc  logically  equivalent  if  they  produce  the  same  answer  in  any  database  whatsoever  in  which  the 
queries  arc  well-defined.  For  instance,  the  query  “list  the  names  of  all  employees  who  arc  not  both 
unmarried  and  over  forty  years  old"  is  logically  equivalent  to  the  query  “list  the  names  of  all 
employees  who  arc  cither  married  or  arc  not  over  forty  years  old.” 

Logically  equivalent  queries  are  obviously  semantically  equivalent,  but  semantically  equivalent 
queries  need  not  be  logically  equivalent  That  is,  two  semantically  equivalent  queries  might  yield 
different  answers  when  posed  to  die  database  in  a  state  where  some  semantic  integrity  constraint  is 
violated. 

For  example,  suppose  there  is  a  semantic  integrity  constraint  to  the  effect  that  the  company  has  no 
employee  under  the  age  of  eighteen.  If  the  database  conforms  to  this  condition,  then  the  query  “list 
the  names  of  all  employees  between  the  ages  of  fifteen  and  twenty”  is  semantically  equivalent  to  the 
query  “list  the  names  of  all  employees  between  the  ages  of  eighteen  and  twenty.”  The  answers  will  be 
the  same  because  the  enforcement  of  the  semantic  integrity  constraint  guarantees  that  there  is  no  item 
in  the  database  corresponding  to  an  employee  between  the  ages  of  fifteen  and  eighteen.  However,  if 
a  violation  of  the  constraint  is  permitted  and  data  is  entered  on  an  employee  whose  age  is  recorded  as 
sixteen  years  old,  then  the  two  queries  will  not  produce  the  same  answer. 
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Another  way  to  look  at  the  difference  between  logical  equivalence  and  semantic  equivalence  is  that 
semantic  equivalence  is  measured  against  a  particular  set  of  semantic  integrity  rules.  For  instance,  if 
the  rule  requiring  employees  to  be  at  least  eighteen  is  changed  so  that  employees  must  be  at  least 
seventeen  instead,  then  the  two  queries  just  discussed  arc  no  longer  semantically  equivalent  The  first 
query  may  return  some  seventeen  year  olds  but  the  second  one  cannot.  By  constrast,  logical 
equivalence  is  unaffected  by  changes  in  the  semantic  integrity  constraints. 

We  also  wish  to  distinguish  semantic  equivalence  from  coincidental  equivalence  in  a  particular 
state  of  the  database.  Semantically  equivalent  queries  must  produce  the  same  answer  in  all  permitted 
states  of  the  database.  A  simple  example  illustrates  what  we  mean  by  coincidental  equivalence. 
Suppose  the  company  happens  to  have  one  employee  named  “Fred  Smith”  and  that  he  happens  to 
be  the  only  employee  who  is  47  years  old.  Then  the  queries  "What  is  the  employee  number  of  each 
employee  named  Fred  Smith?"  and  “What  is  the  employee  number  of  each  employee  who  is  47  years 
old?"  give  the  same  answer.  However,  it  is  easy  to  imagine  a  situation  in  which  the  two  questions  do 
not  give  the  same  answer.  For  instance,  nothing  prevents  the  company  from  hiring  another  47  year 
old  employee  whose  name  is  not  “Fred  Smith”.  If  the  company  does  hire  another  47  year  old,  then 
the  two  questions  do  not  have  the  same  answer. 


3.3  Semantic  integrity  constraints 

The  foregoing  discussion  of  semantic  equivalence  underscores  the  point  that: 

The  basis  of  semantic  equivalence  independent  of  logical  equivalence  and  independent  of 
changes  in  stale  is  the  enforcement  of  the  semantic  integrity  of  the  database. 

The  notion  of  the  semantic  integrity  of  a  database  is  understood  with  respect  to  the  relationship  of 
the  database  to  some  real  world  application.  Every  allowable  state  of  the  database  is  supposed  to  be  a 
valid  "snapshot"  of  aspects  of  the  application.  If  the  database  contains  values  that  cannot  be  attained 
in  the  real  world  application,  then  there  is  said  to  be  a  violation  of  the  semantic  integrity  of  the 
database. 

We  now  formally  develop  the  notion  of  semantic  integ..iy  constraints  for  relational  databases.  In 
so  doing,  we  arc  also  preparing  the  groundwork  for  a  formal  discussion  of  relational  database  queries 
and  semantic  equivalence  transformations. 

Our  point  of  view  is  a  standard  one  in  research  analyzing  databases  in  terms  of  formal  logic  (sec, 
for  instance,  [Gal!airc78]).  The  descriptors  of  relations  and  queries  arc  just  those  of  the  relational 
calculus  that  we  discussed  in  Section  2Tb.  A  relational  database  is  considered  to  be  made  up  of  two 
parts:  an  cxtensional  database  (EDB),  and  an  intensional  database  (1DB). 

The  EDB  is  the  set  of  elementary  assertions  or  tuples  contained  in  the  relations  in  any  particular 
state  of  die  database.  For  instance,  any  of  the  tuples  in  our  example  in  Section  2.1,  such  as 
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("Bralanta"  "Braathan"  "Tanker"  285  17  154) 

is  part  of  the  EDB. 

The  IDB  is  a  set  of  general  laws  expressed  as  closed  well-formed  formulas  in  the  first-order 
predicate  calculus.  The  general  laws,  as  the  name  implies,  apply  more  broadly  than  the  elementary 
assertions.  An  example  of  a  general  law  that  applies  to  all  the  tuples  in  a  single  relation  is  a  rule  that 
all  ships  over  190  thousand  tons  deadweight  (size)  are  supertankers.  This  can  be  expressed  as 

Vx/sHjpS(x-Dcadweight  >  190)  — *  (x.Shiptype  =  “supertanker”),. 

As  noted  above,  this  general  rule  is  expressed  as  a  closed  well-formed  formula  in  a  typed  first-order 
predicate  calculus.  The  variable  x  ranges  over  tuples  of  the  SHIPS  relation.  The  expression 
“x.Dcadwcight”  signifies  the  value  for  the  Deadweight  attribute  of  the  tuple  to  which  x  is  bound. 

Other  general  laws  may  involve  more  than  one  relation.  Suppose  there  is  a  CARGOES  relation 
that  includes  Ship  and  Quantity  attributes.  Then  we  can  express  the  rule  that  a  ship  cannot  carry  a 
greater  quantity  of  cargo  than  the  ship’s  capacity  as  follows: 

^x/Si»PSVy/CARGOES^x'^ll‘pnamc  =  y-Ship)  -*  (y  Quantity  <  x.Capacity) 

Intuitively,  most  general  laws  involve  universal  quantification  over  relations.  However,  it  is  also 
possible  to  express  an  existential  law,  such  as  the  rule  that  there  is  at  least  one  supertanker: 

3x/slllps  (x.Shiptypc  =  “supertanker”). 

Why  divide  the  database  into  cxtcnsional  and  intcnsional  parts?  The  reason  is  the  following 
essential  relationship  between  EDB  and  IDB: 

The  elementary  assertions  or  tuples  of  the  EDB  are  considered  to  define  an  interpretation^ 
of  a  first  order  theory  whose  proper  ( nonlogical)  axioms  are  the  general  laws  of  the  I DB. 


From  the  perspective  of  semantic  query  optimization,  the  importance  of  general  laws  stems  from 
their  use  as  integrity  rules.  In  terms  of  a  first-order  theory  and  its  interpretation,  every  operation  on 
the  database  such  as  adding,  deleting,  or  changing  elementary  assertion,  amounts  to  a  change  in 
interpretation.  In  these  terms,  we  have  the  following  definition: 

Semantic  integrity  is  enforced  if  and  only  if  the  only  changes  permitted  to  the  database  are 
those  that  leave  the  elementary  assertions  of  the  EDB  as  a  model  ( and  not  merely  an 
interpretation)  of  the  semantic  integrity  rules  of  the  IDB. 

In  other  words,  the  enforcement  of  semantic  integrity  prevents  the  database  from  entering  a  state  in 
which  any  of  the  dosed  well-formed  formulas  of  the  integrity  rules  evaluates  to  false. 
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3.4  Query  transformations  that  preserve  semantic  equivalence 

Our  motivation  in  investigating  semantic  integrity  constraints  is  to  see  how  to  transform  a  query 
into  a  semantically  equivalent  query.  The  significance  of  integrity  rules  for  this  purpose  becomes 
apparent  if  we  consider  the  notion  of  satisfiability.  Specifically,  suppose  we  drop  the  universal 
quantifier  from  the  first  general  rule  above.  The  result  is  an  open  formula  in  which  the  previously 
quantified  variable  now  appears  free.  The  open  formula  can  be  put  in  the  form  of  a  query,  similar  to 
the  form  that  appears  in  [Pirottc78]: 

Q..  {x/s|1(ps  |  (x.Dcadwcight  >  190)  -» (x.Shiptype  =  “supertanker”)}. 

The  answer  to  this  query  is  a  set  of  tuples  from  the  SKIPS  relation,  namely  the  set  of  tuples  that 
corresponds  to  ships  whose  deadweight  is  less  than  190  or  which  arc  supertankers.  For  convenience, 
we  omit  any  indication  of  which  attributes  of  x  should  be  relumed. 

The  items  in  the  answer  set  for  this  query  arc  those  tuples  in  the  SHIPS  relation  which,  when 
substituted  for  x  in  the  formula,  make  the  formula  true. 

The  significant  observation  is  that  by  enforcing  the  original  integrity  constraint,  wc  require  that 
every  tuple  in  SHIPS  make  the  formula  true.  Hence,  the  open  formula  is  satisfied  by  the  entire 
SHIPS  relation.  That  is  to  say,  according  to  the  rule,  every  ship  either  has  a  deadweight  of  less  than 
190  or  is  a  supertanker. 

Consider  any  other  query  that  requests  the  set  of  tuples  from  SHIPS  that  satisfy  some  qualification 
Q.  Let  T  be  the  set  of  qualifying  tuples.  The  set  T  is  clearly  a  subset  of  the  set  of  all  tuples  in  SHIPS. 
Rut  all  tuples  in  SHIPS  satisfy  the  integrity  constraint  qualification  Q.,  so  in  particular,  the  tuples  in  T 
satisfy  it  also.  That  is,  no  tuple  of  T  satisfies  the  qualification  Q  but  docs  not  satisfy  the  qualification 
Q .  Therefore,  wc  can  replace  qualification  Q  by  the  conjunction  of  Q  and  and  the  answer  set  T 
remains  the  same.  The  query  with  this  new  qualification  is  semantically  equivalent  to  the  original 
query  with  qualification  Q. 

For  example,  suppose  wc  start  with  a  query  that  requests  all  ships  with  a  draft  greater  than  20 
meters.  Wc  express  this  as: 

{x/sli|ps  I  (x  Draft>2°)}- 

We  can  drop  the  universal  quantifier  from  the  integrity  constraint  and  obtain  the  following 
semantically  equivalent  query: 

{x/jjmps  I  (x. Draft  >  20)  A  ({x.Dcadwcight  >  190)  -*  (x.Shiptype  =  “supertanker"))} 

This  new  query  doesn’t  make  much  sense  as  it  stands.  It  asks  for  those  ships  whose  draft  exceeds  20 
meters  and  which,  if  they  have  a  deadweight  over  190,  also  have  a  shiptype  of  “supertanker”. 
Nevertheless,  this  query  yields  the  same  answer  as  the  original  one.  Now  suppose  wc  had  started 
with  a  query  that  requests  all  ships  with  a  deadweight  of  over  190  thousand  tons: 

{x/sillPS  I  (^Deadweight >  190)}. 

Wc  apply  the  same  transformation  to  this  query  to  obtain: 
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{x/sn|ps  I  (x.Dcadwcight  >  190)  A  ((x.Dcadwcight  >  190)  — ♦ (x.Shiptypc  =  “supertanker”))} 

The  transformation  based  upon  integrity  semantics  has  been  completed,  but  we  can  now  use  the 
logical  axioms  of  first-order  logic  to  transform  this  query  further.  In  particular,  we  use  the 
equivalence  expressed  in  the  axiom  schema: 

(A  A  (A  ->  B))  =  (A  A  B) 

for  any  terms  A  and  B  to  transform  the  query  into  the  simpler,  equivalent  form: 

{x/sl|ipS  I  (x.Deadwcight  >  190)  A  (x.Shiptypc  =  “supertanker”)}. 

This  is  indeed  an  interesting  result  We  started  with  a  constraint  on  the  deadweight  of  ships,  and 
found  that  we  could  add  a  constraint  on  their  shiptype.  If  the  Shiptype  attribute  is  indexed,  the  new 
query  may  be  much  less  expensive  to  process.  The  transformation  corresponds  to  our  intuition  in  this 
case,  as  it  should.  The  integrity  constraint  we  used  says  that  all  ships  with  deadweight  over  190 
thousand  tons  are  supertankers.  The  end  result  looks  like  a  simple  application  of  modus  ponens,  but 
it  is  more  than  this;  it  is  a  transformation  that  depends  on  properties  of  the  database  when  viewed  as 
a  model  of  the  integrity  constraints. 

3.5  Formal  definition  of  semantic  equivalence  transformations 

We  now  develop  a  general,  formal  definition  of  the  type  of  transformation  illustrated  by  the 
foregoing  example.  The  idea  behind  the  definition  is  also  seen  in  the  example.  Die  transformation 
should  permit  us  to  combine  an  integrity  constraint  and  a  query  in  such  a  way  that  the  meaning  of  the 
query  is  not  changed  and  so  that  terms  can  be  further  combined  by  the  application  of  logical 
equivalences.  Our  discussion  has  three  parts:  transformation  of  a  well-formed  formula  (wff)  of  the 
typed  predicate  calculus  by  means  of  merging  with  a  second  wff;  conditions  under  which  the  new  wff 
is  semantically  equivalent  to  the  transformed  wff;  and  the  application  of  this  type  of  transformation 
to  queries  and  semantic  integrity  constraints  in  the  relational  calculus. 


3.5.1  Merging  of  well-formed  formulas 

Consider  two  wffs,  X  and  Y,  of  a  typed,  first-order  predicate  calculus.  Suppose  that  X  has  the  free 

variables  (Xj,  x2 . xn)  and  that  Y  has  the  free  variables  (yj,  y2 . ym).  Bach  variable  x(  and  y^  is 

typed,  that  is,  it  ranges  over  a  specified  domain.  We  can  write  X  and  Y  in  tenns  of  predicates  P  and  Q 
as  follows: 

X  =  P(xj,  x2 . xn) 

Y  =  Q(yry2,...,ym). 

Under  these  circumstances,  we  have  the  following  condition  for  merging  X  and  Y: 
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Formula  Y  can  be  merged  into  formula  X  if  and  only  if  the  variables  (y( y J  can  be  put 

into  one-to-one  correspondence  with  a  subset  of  the  variables  (xr..,x  J  so  that  corresponding 
variables  range  over  the  same  domain  If  this  condition  holds,  then  formulas  X  and  Y  are 
said  to  be  merge-compatible. 

Let  x.'  be  the  variable  in  X  that  corresponds  to  variable  y.  in  Y  (x.'  is  not  necessarily  the  same 
variable  as  x.).  Then  Y  can  be  rewritten  as: 

Y  =  Q(x',  x', ....  xm). 

We  now  take  the  conjunction  Z  of  X  and  Y: 

Z  =  P(xr  x2 . xn)  A  Q(x',  xj, ....  xm). 

But  the  variables  (x',  x' . x'm)  are  a  (possibly  rearranged)  sublist  of  the  variables  (xv  x2 . xn),  so 

we  can  write  Z  just  in  terms  of  the  latter  variables: 

Z  =  P(xr  x2 . xn). 

We  say  dial  the  formula  Z  is  die  transformation  of  the  formula  X  when  merged  with  the  formula 
Y. 


3.5.2  Semantic  equivalence  of  transformed  formulas 

Let  us  assume  diat  each  variable  x,  ranges  over  some  domain  of  values  D,.  Let  us  further  assume 
that  there  is  some  set  I  of  permitted  interpretations  of  the  the  variables  (Xj,...  xn),  where  an 
interpretation  is  an  assignment  of  a  value  from  domain  to  die  corresponding  variable  xjm  for  all  i 
from  1  through  n.  The  set  1  is  a  subset  of  the  Cartesian  product  of  the  domains,  denoted  by  DhDjX 
D2  x  ...  x  Dn.  Under  these  assumptions,  we  have  the  following  definition: 


Two  well-formed  formulas  FI  and  F2  over  free  variables  (xj _ x  J  are  semantically 

equivalent  with  respect  to  (he  permitted  interpretations  if  and  only  if  l  ’l  and  F2  have  the 
same  truth  value  in  every  permitted  interpretation. 

Note  in  particular  that  it  is  not  necessary  for  FI  and  F2  to  have  the  same  truth  value  for  possible 
interpretations  in  D  that  arc  not  in  die  subset  1  of  permitted  interpretations  (as  they  would  have  to  be 
if  dicy  were  logically  equivalent).  Wc  expect  that  D  is  reduced  to  1  by  means  of  semantic  integrity 
constraints. 

The  original  wff  X  and  die  transformed  wff  Z  of  Section  3.5.1  range  over  the  same  set  of  variables. 
Under  what  conditions  arc  they  semantically  equivalent  according  to  the  definition  just  given? 

Formula  Z  is  die  conjunction  of  formulas  X  and  Y.  It  is  dear,  therefore,  diat  Z  is  semantically 
equivalent  to  X  if  and  only  if  the  formula  Y  is  true  under  all  permitted  interpretations  of  die 

variables.  Now,  Y  is  defined  only  in  terms  of  the  variables  (y2 . ym).  a  subset  of  the  variables 

(xr..,xn).  Hence,  every  interpretation  of  (xr...,xn)  includes  an  interpretation  of(yt . ym).  Therefore, 
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The  conjunction  of  two  merge-compatible  formulas  X  and  Y  is  semantically  equ -valent  to 
formula  X.  if  and  only  if  formula  Y  is  true  in  all  permitted  interpretations  of  its  variables 
()‘r—,y  For  our  purposes,  we  call  this  the  validity  requirement. 

There  is  one  more  important  point  to  consider.  Suppose  X  is  actually  the  quantifier-free  matrix  of 
a  quantified  well-formed  formula  F  .  Some  or  all  of  X’s  variables  will  then  be  bound  and  not  free. 
Call  the  bound  variables  b.  and  the  free  variables  f.  Then  F  can  be  written  as 

I  J  X 

Fx:(Q1bi)(Q2b2)...(Qhbh)x.°r 
Fx:(Q1b1)(02b2)...(Qhbh)p(bi . Vr-V- 

where  each  (Q^)  is  either  of  the  two  quantifier  expressions  Vb|  and  3E.  It  is  clear,  however,  that  the 

quantified  well-formed  formula  F  formed  by  substituting  Z  for  X  in  Fx  has  the  same  truth  value  as 

F  for  all  permitted  assignments  of  values  to  the  free  variables  f,  through  f  . 
x  *  8 

3.5.3  Transformation  of  a  query  using  a  semantic  integrity  constraint 

We  now  connect  die  discussion  with  our  central  interest  in  queries  and  semantic  knowledge.  Here, 
the  role  of  the  fonmila  to  be  transformed,  F  .  is  assumed  by  a  database  query.  The  role  of  the 
merging  formula  Y  is  played  by  a  semantic  integrity  constraint.  The  resulting  fonmila  Fz  is  the  new 
semantically  equivalent  query. 

Wc  draw  upon  the  view  of  a  database  in  terms  of  relational  calculus,  described  in  Section  3.3.  Let 

P(bj . bm,fj . fn)  be  a  well-formed  formula  of  the  tuple  relational  calculus  [Pirotte78]  with  free 

variables  bj  through  bm  and  fj  through  fft.  Every  variable  is  understood  to  range  over  the  tuples  of  a 
single  relation.  As  before,  let  (Q  b()  be  cither  of  the  two  quantifier  expressions  Vb(  and  3b(.  Then  any 
query  can  be  expressed  in  the  form: 

Q:(Q1b1XQjb1)..(0„b<yP,fl>1. . bm.f, . f„). 

Considering  the  query  Q  as  a  whole,  variables  bj  through  b  arc  bound,  and  variables  ^  through 
f  are  free.  There  arc  two  kinds  of  queries  to  consider.  In  a  closed  query,  there  arc  no  free  variables 
(n  =  0).  The  answer  to  a  closed  query  is  a  yes/no  answer,  depending  upon  whether  or  not  Q  is  true 
with  respect  to  the  current  interpretation,  that  is,  the  current  contents  of  the  extcnsional  database 
(F.DB).  If  there  arc  free  variables  (n  >  0),  then  the  query  is  an  open  query.  The  answer  to  an  open 
query  is  the  set  of  assignments  to  the  free  variables  fj  through  fn  that  make  Q  true  in  the  current 
interpretation.  Because  variables  range  over  the  tuples  of  relations,  the  answer  to  an  open  query  is  a 
set  of  n-tuplcs  of  relation  tuples.  An  open  query  need  not  have  any  quantifier  expressions;  it  must 
have  free  variables.  In  cither  case,  provision  must  also  be  made  to  extract  tuple  attributes  for 
comparison  or  retrieval  purposes. 

As  noted  in  Section  3.3.  a  semantic  integrity  constraint  can  be  represented  as  a  closed  well-formed 
formula  of  the  relational  calculus.  Hence,  wc  can  express  a  constraint  as: 
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C:  (Q^XQjCj)  ...  (Qkck)  Pc(c1 . ck) 

where  there  arc  no  free  variables  in  C  taken  as  a  whole.  Constraint  C  has  the  very  important  property 
that  it  evaluates  to  “true"  in  all  permitted  states  of  the  database;  indeed,  that  is  the  definition  of 
semantic  integrity  enforcement. 

As  we  stated  above,  we  want  to  have  query  Q  and  constraint  C  play  the  roles  of  formulas  Fx  and  Y 
of  Section  3.5.1,  respectively.  It  is  evident  that  a  query  Q  is  very  much  like  the  kind  of  formula  Fx 
given  above.  The  only  additional  specification  is  that  the  variables  range  over  the  tuples  of  database 
relations.  However,  the  correspondence  between  a  semantic  integrity  constraint  C  and  the  formula  of 
type  Y  is  not  so  immediate.  We  must  confront  the  fact  that  constraint  C  has  no  free  variables  as  it 
now  stands,  so  it  can’t  be  merge  compatible  with  an  ,ther  formula. 

We  remedy  the  absence  of  free  variables  in  C  in  such  a  way  that  we  insure  the  validity  requirement 
stated  in  Section  3.5.2.  Namely,  we  allow  any  universal  quantifier  in  C’s  prefix  to  be  dropped.  If  the 
quantifier  Vcj  is  dropped,  then  C  can  now  be  expressed  as  the  formula  P^(c ),  a  formula  with  no  prefix 
and  the  single  free  variable  c.  The  resulting  formula  must  be  true  in  all  permitted  interpretations 
(assignments  of  a  value  to  variable  C-).  This  is  because  the  original  universally  quantified  constraint 
says  precisely  that  the  formula  is  true  for  all  values  of  variable  c.. 

A  universal  quantifier  can  be  dropped  wherever  it  appears  in  the  prefix,  even  if  it  appears  within 
the  scope  of  an  existential  quantifier.  This  can  be  seen  from  the  logical  theorem 

(PREF1)(3x)(Vy)(PRFF2)P(z1,x,y.z2)-(PREF1)(Vy)(3x)(PRFF2)P(7.1,x,y,x2) 

where  (PRHFj)  and  (PREF2)  stand  for  portions  of  the  prefix.  This  means  that  a  universal  quantifier 
can  be  "moved  left”  outside  the  scope  of  an  enclosing  existential  quantifier,  hence  outside  the  scope 
of  any  existential  quantifier. 

We  do  not  permit  an  existential  quantifier  to  be  dropped.  To  see  why  we  impose  this  restriction, 
consider  w!,jt  it  would  mean  to  do  so.  The  variable  bound  by  the  quantifier  would  now  be  free. 
What  tuples  in  the  range  of  the  variable  would  satisfy  the  resulting  formula?  W'c  have  no  way  to  tell. 
All  we  know  is  that  at  least  one  tuple  does  satisfy  the  formula,  but  we  cannot  assert  that  lire  formula  is 
true  for  every  such  assignment. 

It  must  be  noted  of  course  that  the  requirement  of  merge  compatibility  means  that  we  can  only 
create  free  variables  in  the  constraint  for  which  there  is  a  corresponding  free  variable  in  the  matrix  of 
the  query. 

We  now  have  a  direct  parallel  to, the  process  set  forth  in  Sections  3.5.1  and  3.5.2.  We  summarize 
the  process  for  transforming  a  query  into  a  semantically  equivalent  query  as  follows: 

Let  Q  be  a  query  expressed  in  tire  tuple  relational  calculus: 

Q:(Q1b1)(Q2b2)...(Qmbm)Pq(b1 . ,bm,fr . fn), 

where  every  variable  is  understood  to  range  over  the  tuples  of  a  single  relation  and  (Qhj)  is  either  of 
the  two  quantifier  expressions  Vb)  and  3bj. 
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Let  C  be  a  semantic  integrity  constraint  represented  as  a  closed  well-formed  formula  of  die  tuple 
relational  calculus: 


C:  (Q1c1XQ2c2)  ...  (Qkck)  Pc(cr..,ok) 

where  C  has  the  property  that  it  evaluates  to  true  in  all  permitted  states  of  the  database.  Let  (ca,  cb, 

...,  c)  be  a  subset  of  the  universally  quantified  variables  of  constraint  C,  and  let  P'(c  cb . c)  be  the 

well-formed  formula  produced  by  dropping  the  corresponding  universal  quantifiers.  Then,  if  and 

only  if  the  variables  (ca,  cfe . can  be  put  into  one-to-one  correspondence  with  a  subset  of  the 

variables  (b2 . bm,f, . fn)  so  that  corresponding  variables  range  over  the  same  relation,  it  follows 

that  die  query  Q 'given  by: 


Q  ■  W-  (Qmbm)  Pq(b1 . bm.fr...,fn)  A  P'(ca,  cb,...,c) 

is  semantically  equivalent  to  query  Q;  diat  is,  Q 'gives  the  same  answer  as  Q  in  every  permitted  state 
of  die  database.  For  convenience,  die  newly  transformed  query  can  be  written  as: 


Q':(Q1b1)(Q2b2)...(Ombm)Pq(b1,,.,bm.f1 . f„) 

where  Pq  is  the  conjunction  of  Pq  and  P'. 


3.6  Logical  transformations  in  semantic  query  optimization 

A  semantically  equivalent  query  formed  according  to  die  preceding  definitions  may  well  be  more 
expensive  to  process  than  the  original  query.  After  all,  the  new  query  apparently  involves  more  terms 
than  the  original.  However,  various  improvements  in  efficiency  may  arise  by  a  further 
transformation  or  simplification  of  die  new  expression,  based  upon  die  replacement  of  terms  by 
terms  diat  are  logically  equivalent.  The  effect  is  that  terms  in  die  new  qualification  expression  are 
subject  to  cancellation  or  combination.  Simplifications  can  be  based  upon  such  domain-independent 
properties  as  transitivity  of  numerical  comparators,  along  the  lines  suggested  by  Youseffi 
[Youscrfi78]. 

Of  great  importance  are  simplifications  that  involve  semantic  integrity  constraints  in  the  form  of 
implications.  To  see  this,  consider  a  constraint  of  the  form 

Vx  P(x)  -4  Q(x) 

where  the  variable  x  ranges  over  some  relation  R.  Suppose  die  matrix  of  some  quciy  Q  contains  the 
term  (’(/)  where  the  variable  z  ranges  over  die  same  relation  as  die  variable  x  in  the  constraint 
According  to  die  procedures  outlined  in  die  preceding  sections,  we  can  transform  Q  into  a 
semantically  equivalent  query  Q 'whose  matrix  contains  die  conjunction: 

P(z)  A  (P(z)  -»  Q(z)). 

However,  by  die  logical  equivalence 

(A  A  (A  -*  B))  =  (A  A  B) 

we  can  replace  diis  conjunction  by  die  simpler  conjunction 
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PWAQ(z). 

T*hc  net  effect  is  as  if  the  original  query  condition  P(z)  were  used  to  infer  the  new  condition  Q(z) 
by  means  of  the  semantic  integrity  constraint.  Similarly,  if  the  original  query  contains  the  term  ->Q(y) 
where  die  variable  y  ranges  over  relation  R,  then  by  the  equivalence 

(^A(A-4B))  =  HA-’A) 

this  term  can  be  replaced  by  the  conjunction  ->Q(y)  A  ~*f>(y)-  Indeed,  if  the  original  query  actually 
contains  the  conjunction  P(z)  A  Q(z),  z  ranging  over  relation  R,  then  by  using  the  logical  equivalence 

(AaBA(A-*B))=A 

we  can  replace  this  conjunction  by  the  simple  condition  P(z).  In  other  words,  the  condition  Q(z)  has 
been  shown  to  be  derivable  from  P(z),  hence  it  is  superfluous  and  may  be  dropped  from  the  query. 

This  concludes  our  general  discussion  of  the  formal  basis  for  semantic  query  optimization.  In  the 
next  chapter,  we  describe  the  QUIST  system,  in  which  these  ideas  have  been  implemented  and 
tested. 
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Chapter  4 
The  QUIST  system 


In  Chapter  3,  we  presented  the  formal  basis  for  the  transformation  of  one  relational  database  query 
t  into  another  semantically  equivalent  query.  This  semantic  equivalence  transformation  is  at  the  heart 

of  semantic  query  optimization.  In  this  chapter,  we  take  up  the  issue  of  creating  an  effective  semantic 
query  optimization  system,  and  we  describe  the  operation  of  QUIST,  an  implemented  semantic 
query  optimization  system. 

I  In  Section  4.1,  we  discuss  the  factors  that  influence  the  effectiveness  of  a  semantic  query 

optimization  system,  particularly  the  choice  of  what  semantic  knowledge  should  ever  be  considered 
for  semantic  query  optimization,  and  the  way  that  structural  and  processing  knowledge  is  used  to 
control  the  semantic  transformation  of  queries.  We  begin  the  description  of  QUIST  in  Section  4.2, 
noting  the  class  of  queries  it  handles  and  the  types  of  semantic  rules  it  uses.  In  Section  4.3  we  present 
*  an  overview  of  system  operation.  We  indicate  that  QUIST  operates  in  a  plan-generatc-tcst  mode  in 

which  the  problem  of  query  optimization  is  addressed  at  different  levels  of  abstraction.  Finally,  in 
Section  4.4  we  discuss  the  actions  of  the  system  in  great  detail  by  means  of  an  example.  We  show 
how  the  generalizations  about  processing  queries  to  relational  databases  discussed  in  Chapter  2  are 
^  incorporated  in  specific  heuristics.  We  show  specifically  how  the  heuristics  are  used  to  control  which 

knowledge  base  rules  are  used  for  query  transformations,  and  we  relate  the  heuristics  in  general  to 
particular  types  of  transformations  of  relational  database  queries. 


t  4.1  The  design  of  an  effective  semantic  query  optimization  system 

A  query  optimization  strategy  based  upon  semantic  equivalence  transformations  presents  boL- 
opportunities  and  dangers.  The  opportunities  lie  in  the  possibility  of  eliminating  unneeded 
operations,  or  replacing  or  modifying  operations  with  more  efficient  ones.  The  dangers  arise  from 
^  what  may  be  a  large  store  of  semantic  integrity  constraints.  Any  query  might  possibly  be  transformed 

by  any  combination  of  those  constraints.  If  not  controlled  in  some  way,  the  process  of  generating 
transformations  of  the  query  can  be  very  expensive. 

There  arc  two  ways  to  bring  the  process  under  control:  by  restricting  what  semantic  integrity 
®  constraints  will  ever  be  considered  for  query  transformation,  and  by  using  knowledge  of  database 

structure  and  processing  to  guide  the  transformation  of  any  particular  query. 
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4.1 .1  Choosing  semantic  knowledge 

The  kinds  of  knowledge  that  arc  most  useful  for  semantic  query  optimization  depend  primarily  on 
two  factors:  die  kinds  of  queries  that  arc  to  be  handled,  and  the  physical  organization  of  the  data. 

The  most  common  kinds  of  queries  involve  access  between  entities  and  their  attributes.  A  typical 
query  may  be  “What  is  the  length  of  the  Totor?”  in  which  an  entity’s  name  (or  identifier)  is  given  and 
some  attributes  arc  sought.  The  direction  of  access  is  reversed  in  a  query  like  “What  arc  the  names  of 
the  French  ships  over  300  feet  long?”  in  which  constraints  are  specified  on  die  values  of  attributes 
and  the  retrieval  of  entity  identifiers  is  sought.  Both  directions  of  access  arc  combined  in  a  query 
such  as  “List  the  draft  of  French  ships"  where  a  set  of  entities  is  specified  by  means  of  constraints  on 
one  set  of  attributes,  and  rcuieval  of  another  set  of  attributes  is  sought  When  the  query  involves 
relationships  and  not  just  objects,  access  between  entities  and  attributes  is  still  crucial.  For  instance, 
in  the  query  “Which  Italian  ships  are  commanded  by  admirals?”  the  set  of  ship  captains  who  are 
considered  as  commanders  is  confined  to  diose  whose  rank  (an  attribute)  meets  a  specified  constraint 

The  importance  and  frequency  of  these  queries  is  reflected  in  the  physical  organization  of 
databases,  die  second  major  factor  that  influences  what  semantics  should  enter  into  semantic  query 
optimization.  A  prime  objective  of  semantic  query  optimization  is  to  produce  useful  constraints.  As 
pointed  out  in  Chapter  2,  the  physical  structure  of  a  relational  database  is  typically  organized  into 
records  and  fields  that  correspond  to  entities  and  attributes.  In  anticipation  of  queries  with 
constraints  on  attributes,  indexes  arc  stored  that  contain  pointers  to  physical  locations  of  records 
(entities)  with  particular  values  in  certain  fields  (attributes).  Constraints  on  indexed  attributes 
obviously  arc  useful,  as  arc  constraints  on  attributes  of  entities  that  have  links  to  other  entities. 

In  consideration  of  both  the  common  kinds  of  queries  and  the  typical  physical  organization  of 
databases,  it  is  evident  that  constraints  on  the  attributes  of  entities  arc  of  utmost  importance.  The 
kind  of  semantic  rules  that  are  most  useful  arc  rules  that  relate  constraints  on  attributes  expressed  in 
queries  with  constraints  that  arc  useful  in  the  sense  just  described. 

This  observation  leads  to  die  view  of  semantic  query  optimization  as  a  movement  of  constraints 
among  different  parts  of  the  database.  One  kind  of  semantic  rule  that  directly  supports  the 
movement  of  constraints  is  what  Kent  [Kcnt78]  calls  general  restrictions  on  relationships.  These  are 
constraints  on  the  participants  in  relationships  that  arc  more  specific  than  simply  designating  their 
entity  type.  They  relate  properties  or  attributes  of  one  participant  with  properties  or  attributes  of 
another.  One  such  kind  states 

Cl  0C2 

where  Cl  and  C2  arc  simple  restrictions  on  attributes  and  6  is  a  Boolean  comparator  such  as  less-than 
or  grcatcr-than.  For  example,  there  may  be  a  relationship  between  a  consignment  of  cargo  and  the 
insurance  policy  that  covers  it,  to  the  effect  that  the  amount  of  the  policy  docs  not  exceed  die  value  of 
die  consignment.  In  this  ease,  there  is  a  relationship  between  the  amount  attribute  of  the  policy  and 
the  value  attribute  of  die  consignment.  Another  kind  of  rule  dint  restricts  relationships  states 
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for  constraints  Cl  and  C2.  For  instance,  wc  may  know  that  only  leasing  companies  own  ships  with  a 
deadweight  (size)  over  some  amount.  That  is,  given  a  certain  constraint  on  a  ship’s  deadweight, 
another  constraint  can  be  inferred  on  the  type  of  business  of  the  company  that  owns  the  ship. 


4.1.2  Controlling  transformations  with  structural  and  processing  knowledge 

There  is  no  guarantee  that  any  semantic  equivalence  transformation  leads  to  a  lower  cost  query. 
Indeed,  a  competent  database  administrator  chooses  database  file  structures  that  support  efficient 
access  to  frequently  referenced  data.  Thus,  assuming  that  a  conventional  query  optimizer  is  used,  it  is 
reasonable  to  expect  that  many  queries  can  be  answered  efficiently  in  the  form  in  which  they  are 
posed. 

Therefore,  an  effective  semantic  query  optimization  system  must  determine  whether  to  seek  cost 
reductions  via  semantic  transformations.  If  it  does,  it  must  confine  its  efforts  as  much  as  possible  to 
the  transformations  that  are  most  likely  to  result  in  lower  cost  queries.  It  should  not  undertake  costly 
efforts  only  to  find  that  a  reasonably  efficient  query  cannot  be  improved  further. 

As  we  discussed  in  Chapter  3,  the  ability  to  carry  out  semantic  equivalence  transformations  rests  on 
the  semantic  knowledge  about  the  database.  As  we  shall  sec  in  this  chapter,  the  ability  to  control  the 
semantic  query  optimization  system  depends  upon  knowledge  about  what  transformations  arc  likely 
to  yield  a  lower  cost  query.  This  ability  rests  in  turn  upon  two  kinds  of  knowledge:  knowledge  of  the 
physical  file  organization  of  the  database,  and  knowledge  of  the  available  retrieval  processes, 
particularly  in  terms  of  how  various  aspects  of  those  processes  influence  their  cost 

In  Chapter  2  wc  indicated  that  one  of  the  main  results  of  conventional  query  optimization  is  the 
identification  of  standard  file  structures  and  an  appreciation  of  the  factors  that  contribute  to  the  cost 
of  query  processing.  Wc  can  sec  how  this  interacts  with  judging  the  potential  usefulness  of  a 
semantic  transformation.  Consider  the  query  “What  ships  arc  carrying  iron  ore?"  posed  to  a  database 
that  lists  information  about  ships  and  their  current  cargoes.  Three  kinds  of  information  can  be 
brought  to  bear  to  decide  the  usefulness  of  semantic  transformations  in  this  case. 

•  Knowledge  of  processing  cost  factors.  Two  way:  to  extract  qualifying  tuples  from  a 
relation  are  to  perform  a  sequential  segment  scan  and  to  pcrfoim  a  scan  by  way  of  a 
clustered  index.  The  latter  method  is  usually  much  less  expensive.  This  means  that  the 
presence  of  a  restriction  on  a  clustered  index  attribute  significantly  lowers  the  cost  of  this 
kind  of  process. 

•  Knowledge  of  file  structures.  In  this  case,  let  us  assume  that  there  is  a  SHIPS  relation 
stored  as  one  file,  and  that  the  file  has  a  clustered  index  on  the  Shiptypc  attribute. 

•  Knowledge  about  the  semantics  of  the  database.  Let  us  assume  that  there  is  a  semantic 
integrity  constraint  to  the  effect  that  the  only  type  of  ship  capable  of  carrying  iron  ore  is  a 
bulk  ore  carrier.  That  is,  no  tuple  can  exist  in  the  SHIPS  relation  for  which  the  Cargo 
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field  has  die  value  “iron  ore”  and  the  Shiptype  field  has  some  value  other  than  “bulk  ore 
carrier”. 

From  the  knowledge  of  processing  cost  factors,  an  effective  semantic  query  optimization  systerq. 
should  rate  as  potentially  useful  any  transformation  that  starts  with  a  query  that  must  be  processed  by 
a  segment  scan  and  that  results  in  a  query  that  can  be  processed  by  an  indexed  scan.  From  the 
knowledge  of  file  stivctures,  die  system  should  determine  for  this  query  that  dicre  is  a  potential 
opportunity  to  make  this  kind  of  transformation.  What  is  needed  is  a  semantic  constraint  diat  relates 
the  values  of  the  Shiptype  and  Cargo  attributes.  The  knowledge  of  the  semantics  of  the  database 
gives  such  a  constraint  in  this  case.  The  query  can  be  transformed  into  “What  bulk  ore  earners  are 
carrying  iron  ore?”  The  cost  to  process  this  query  should  be  compared  to  the  cost  of  the  original 
query  to  select  the  one  to  be  posed  to  the  database. 

This  example  suggests  how  die  flow  of  information  and  control  can  be  organized  in  an  effeedve 
semantic  query  optimization  system.  The  system  analyzes  the  query  with  respect  to  processing 
methods  and  file  structures.  The  analysis  identifies  potentially  useful  transformations  specialized  to 
the  context  of  the  current  query.  That  is,  dicy  arc  expressed  in  terms  of  relations  or  attributes  that  are 
involved  in  die  query.  If  potentially  useful  transformations  arc  identified,  die  system  retrieves 
appropriate  semantic  constraints  using  the  specialized  descriptions.  The  system  then  carries  out 
semanue  equivalence  transformations  and  simplifications  with  those  constraints.  Finally,  the  system 
evaluates  the  efficiency  of  die  resulting  queries  and  selects  for  processing  die  one  with  lowest 
estimated  cost. 

4.2  Introduction  to  the  QUIST  system 

The  QUIST  system  (QUcry  Improvement  through  Semantic  Transformation)  is  a  program  that  has 
been  implemented  to  explore  die  design  and  operation  of  an  effective  semantic  query  optimization 
system  in  the  context  of  an  important  class  of  relational  database  queries,  flic  system  demonstrates 
the  ability  to  transform  queries  by  reasoning  about  the  semantics  of  the  database.  It  shows  diat  it  is 
possible  for  a  semantic  query  optimization  system  to  achieve  significant  improvements  in  query 
processing  efficiency  that  are  unattainable  by  conventional  methods.  It  also  shows  that  a  semantic 
query  optimization  system  can  run  with  acceptable  overhead  compared  to  the  overall  cost  of 
processing  queries.  In  doing  so,  QUIST  demonstrates  the  use  of  specific  inference  guiding  heuristics 
based  on  structure  and  processing  expertise  originating  in  conventional  query  optimization  research. 

In  this  section,  we  describe  the  class  of  queries  for  which  QUIST  can  attempt  semantic  query 
optimization.  We  also  indicate  die  kinds  of  semantic  integrity  rules  diat  QUIST  can  use  for  diis 
purpose.  The  choice  of  the  kinds  of  semantic  knowledge  used  by  QUIST  follows  the  ideas  set  forth 
in  Section  4.1.  We  take  up  die  other  issue  of  Section  4.1,  the  control  of  semantic  transformations  by 
means  of  structural  and  processing  knowledge,  later  in  this  chapter. 
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4.2.1  The  class  of  queries  handled  by  QUIST 

The  QUIST  query  language  is  a  query  language  for  relational  databases.  It  is  in  a  class  of 
languages  that  can  be  termed  attribute/constraint  languages.  This  choice  reflects  the  importance  of 
constraints  on  attributes  as  described  in  Section  4.1.  Indeed,  the  entire  QUIST  system  is  designed 
from  the  point  of  view  that  the  most  useful  semantic  transformations  in  relational  queries  can  be  seen 
as  the  addition,  deletion,  or  modification  of  constraints  on  database  attributes. 

Attribute/constraint  languages  arc  particularly  simple,  hence  somewhat  limited,  yet  have  been 
shown  to  admit  a  significant  subset  of  rcstrict-join-project  relational  queries  (see  Section  2.3).  Two 
examples  of  attribute/constraint  query  languages  are  the  IDA  language  developed  by  Sagalowicz  at 
SRI  International  [Sagalowicz77j  and  the  APPLE  language  developed  by  Carlson  and  Kaplan  at 
Northwestern  University  |Carlson76J.  The  QUIST  query  language  is  modelled  most  closely  on  IDA. 
In  the  context  of  the  LADDER  natural  language  database  access  system  [Hendrix78],  IDA  has  been 
shown  to  admit  a  substantial  and  interesting  class  of  queries. 

The  essential  distinguishing  feature  of  an  attribute/constraint  language  is  that  it  presents  a 
relational  database  as  if  it  contained  just  a  single  virtual  relation,  masking  the  real  relations 
underlying  it.  The  point  of  this  is  to  make  the  specification  of  relational  database  queries  as  simple  as 
possible.  It  buffers  users  and  natural-language  understanding  programs  from  the  need  to  know  the 
structure  of  the  database  and  from  any  reorganization  of  the  database  that  involves  changes  in  the 
association  of  attributes  with  relations. 

The  single  virtual  relation  is  formed  from  the  real  relations  as  follows.  A  subset  of  all  the  possible 
joins  between  relations  is  specified  such  that  at  most  one  join  is  permitted  between  any  two  real 
relations,  and  so  that  there  exists  one  and  only  one  logical  path  (sequence  of  joins)  between  any  two 
real  relations.  The  set  of  joins  is  performed  and  duplicates  arc  eliminated.  The  result  is  the  virtual 
relation.  If  the  virtual  relation  is  thought  of  as  a  graph  whose  nodes  are  the  real  relations  and  whose 
edges  are  joins  between  real  relations,  then  the  virtual  relation  is  a  tree  structure  of  real  relations. 
Any  query  to  the  database  involves  a  subtree  of  this  virtual  relation. 

The  virtual  relation  makes  possible  a  great  simplification  in  the  specification  of  rcstrict-join-project 
queries:  joins  arc  made  implicit  because  they  have  already  been  specified  in  the  definition  of  the 
virtual  relation  That  means  that  an  attribute/constraint  query  is  specified  solely  in  terms  of 
restrictions  and  projections.  In  other  words,  an  attribute/constraint  query  consists  of  boolean 
combinations  of  simple  constraints  on  attributes,  plus  a  list  of  attributes  whose  values  are  desired. 
Significantly,  tuple  variables  need  no  longer  be  used  in  the  query,  because  there  is  only  one  (virtual) 
relation. 

The  cost  of  this  simplification  is  a  set  of  limitations  on  the  general  relational  model.  For  one  thing, 
attribute  names  must  be  unique  throughout  all  relations  because  tuple  variables  are  no  longer 
available  to  distinguish  them.  For  another,  no  join  is  permitted  other  than  those  prcspccificd  through 
die  definition  of  the  virtual  relation.  The  tatter  limitation  implies  that  a  relation  can  only  be  involved 
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once  in  a  query  (for  instance,  it  cannot  be  joined  to  itself).  With  respect  to  the  concepts  represented 
in  die  database,  this  means  that  it  is  possible  to  represent  only  one  kind  of  relationship  between  any 
two  classes  of  entities  represented  as  relations.  Moore  discusses  some  of  these  limitations  in 
(Moorc79J.  Neverdicless,  as  indicated  above,  attributc/constraint  languages  permit  the  expression  of 
an  important  and  useful  range  of  queries. 

The  level  of  abstraction  presented  by  an  attribute/constraint  query  language  is  illustrated  by  an 
example  from  IDA.  Suppose  a  database  contains  two  relations: 

SHIP:  (Shipname  Shipclass  Shiptype) 

SHIPCLASS:  (Class  Type  Length  Draft) 

where,  for  instance,  Shipname  is  the  name  of  a  ship  and  Length  is  the  length  of  any  ship  in  a 
particular  ship  class.  Assume  the  choice  is  to  permit  SHIP  and  SHIPCLASS  to' be  joined  on 
Shipclass  and  Class,  respectively. 

A  request  for  the  names  of  all  ships  is  merely  a  request  to  print  all  values  of  the  Shipname 
attribute.  No  constraints  need  to  be  specified.  In  IDA,  this  request  is  specified  as  (?  Shipname).  In 
general,  an  expression  of  the  form  (?  Attribute)  returns  die  value  of  the  specified  attribute.  To 
request  the  length  of  a  ship  whose  name  is  “Totor”,  it  is  necessary  to  place  a  constraint  on  the 
Shipname  attribute  and  to  request  the  value  of  the  Length  attribute.  The  IDA  specification  is: 

(Shipname  =  “Totor")  (?  Length). 

The  two  attributes  arc  on  separate  underlying  relations,  but  IDA  hides  this  from  the  user.  The 
IDA  query  processing  system  determines  the  logical  access  path  between  the  two  relations.  It  looks 
up  the  prespecificd  join  between  SHIP  and  SHIPCLASS  and,  in  effect,  transforms  the  joinless  form 
of  the  query  into  one  that  includes  the  join  term  (SHIP.Shipclass  =  SHIPCLASS.Class). 

The  class  of  queries  handled  by  QUIST  is  actually  somewhat  different  from  IDA's.  The 
qualification  of  a  QUIST  query  is  a  conjunction  of  constraints  on  attributes,  rather  than  a  general 
boolean  combination  of  constraints.  This  limitation  is  compensated  for  by  permitting  die  constraint 
on  each  attribute  to  be,  in  effect,  a  disjunction  of  simple  constraints.  QUIS  T  docs  not  attempt  to 
perform  semantic  transformations  on  such  questions  as  "What  ships  are  registered  in  France  or  are 
over  200  feet  long?"  where  the  disjunction  involves  constraints  on  more  than  one  attribute.  In  this 
ease,  the  design  decision  was  to  avoid  the  added  difficulties  of  inference  with  general  disjunctions  on 
the  grounds  that  many  practical  queries  do  not  involve  them  and  because  of  the  low  probability  of 
finding  less  expensive  transformations  of  them. 

As  with  IDA,  QUIST  queries  can  specify  constraints  on  numerical-valued  attributes  and 
constraints  on  string-valued  attributes.  A  numerical  constraint  is  specified  as  the  intervals  in  which 
the  attribute's  value  is  permitted  to  fall.  The  complete  constraint  can  be  a  disjunction  of  these 
intervals.  For  instance,  suppose  a  query  constrains  the  Age  attribute  in  a  personnel  database  to  be 
greater  than  20  and  less  than  or  equal  to  25,  or  to  be  greater  than  or  equal  to  65  and  less  than  70.  This 
constraint  is  specified  as 
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(Age  G  ((20  25]  [65  70))). 

This  constraint  is  considered  to  be  a  disjunction  of  two  intervals.  QUIST  checks  that  intervals  do 
not  conflict.  If  the  constraint  on  a  numerical  attribute  is  in  fact  a  simple  constraint,  such  as  specifying 
that  Age  is  less  than  65,  then  the  preceding  form  can  be  abbreviated  as 

(Age  <  65) 

rather  than,  for  instance,  specifying  an  interval  one  of  whose  bounds  is  +oo  or  -oo. 

String-valued  attributes  can  be  constrained  to  be  a  member  of  some  set  of  strings,  or  to  be 
excluded  from  some  set  of  strings.  For  example,  if  Shiptype  must  be  either  “tanker"  or  “fishing”,  the 
constraint  is  specified  as: 

(Shiptype  E  {“tanker”  “fishing”}). 

Another  type  of  constraints  for  string-valued  attributes  is  typified  by  the  constraint  that  Shiptype 
must  be  neither  “bulk”  nor  “refrigerated”: 

(Shiptype  £  (“bulk”  “refrigerated"}). 

This  is  equivalent  to  a  conjunction  of  simple  inequality  constraints.  As  with  numerical  constraints, 
the  notation  for  a  simple  constraint  can  be  abbreviated,  as  for  example: 

(Shiptype  =  “supertanker”) 

to  indicate  that  the  Shiptype  must  be  a  supertanker. 

The  complete  syntax  of  queries  admitted  by  the  QUIST  system  is  given  in  Appendix  A. 


4.2.2  QUIST’s  semantic  knowledge  base 

The  QUIST  system  captures  the  important  semantic  integrity  restrictions  on  attributes  and 
relationships  described  in  Section  4.1.  The  single-relation  view  of  the  database  makes  it  easy  to 
express  these  restrictions,  subject  to  the  limitation  that  only  one  kind  of  relationship  can  be 
represented  between  any  two  kinds  of  entities.  The  restrictions  are  stored  in  a  “conceptual  schema” 
or  knowledge  base  where  they  arc  associated  with  the  attributes  they  mention. 

The  simplest  type  of  restriction  is  what  McLeod  [McLeod76]  refers  to  as  domain  definition.  This 
type  of  restriction  specifies  the  possible  values  of  an  attribute  regardless  of  the  values  of  any  other 
attributes,  and  regardless  of  any  relationships  involving  the  entity  to  which  the  attribute  is  associated. 
For  instance,  if  it  is  known  that  all  ships  in  the  database  have  a  deadweight  of  between  20  thousand 
tons  and  450  thousand  tons,  regardless  of  their  shipclass,  their  registry,  the  type  of  business  of  their 
owner,  or  any  other  factor,  then  the  knowledge  base  would  associate  with  the  Deadweight  attribute 
the  restriction: 


(Deadweight  E  ([20  450])). 
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In  terms  of  the  more  general  first-order  formulas  described  in  Chapter  3,  domain  definition 
restrictions  arc  implicitly  universally  quantified  over  the  (real)  relation  to  which  the  restricted 
attribute  is  associated.  Thus,  die  example  restriction  corresponds  to: 

Vx/gHjps  (x.Deadweight  2: 20)  A  (x.Dcadweight  £  450) 

The  other  types  of  restrictions  involve  two  or  more  attributes.  Two  kinds  of  multiattribute 
restrictions  are  represented.  One  kind,  called  a  bounding  rule ,  asserts  Chat  the  value  of  one  attribute  is 
bounded  by  the  value  of  another  attribute.  For  example,  the  quantity  of  a  cargo  that  can  be  carried 
by  a  ship  is  bounded  by  the  capacity  of  the  ship.  If  there  are  two  relations,  CARGOES  and  SHIPS, 
and  the  unique  logical  access  path  defined  between  them  corresponds  to  a  "carrying”  relationship, 
then  the  bounding  rule  can  be  represented  simply  as: 

(Quantity  £  Capacity). 

The  corresponding  form  of  this  restriction  in  terms  of  a  general  first  order  formula  is  a  universally 
quantified  expression  in  which  the  predefined  logical  access  path  between  SHIPS  and  CARGOES  is 
made  explicit. 

Vx/SH1PS  Vy/CARGOES  (x  ShiPname  =  yShip)  -*  (y .Quantity  <  x.Capacity) 

The  other  type  of  multiattribute  semantic  restriction  is  called  a  production.  A  production  is  a  rule 
of  the  form  : 

C^Aj)  A  C2(A2)  A  ...  A  Ck(Ak)  -  C'(A). 

Every  term  in  the  rule  is  a  constraint  expression  on  an  attribute.  No  attribute  can  appear  more 
than  once  on  the  left  hand  side.  An  example  of  a  production  is  a  rule  that  states  that  cargoes  of 
refined  petroleum  products  arc  carried  only  by  ships  whose  deadweight  is  under  60  thousand  tons. 
This  rule  involves  the  same  “carrying"  relationsh:?  and  hence  the  same  implicit  join  between 
CARGOES  and  SHIPS  as  in  the  previous  example.  In  this  case,  the  rule  is  represented  as: 

(Cargotypc  =  “refined")  — *  (Deadweight  £  60) 

where  Deadweight  is  in  units  of  thousands  of  tons.  The  production  form  used  by  QUIST  is  the  Horn 
clause  form  common  in  deductive  databases  [Nicolas78a]. 

As  with  bounding  rules,  the  corresponding  general  first  orde'  formula  is  a  universally  quantified 
expression  with  an  explicit  join  term  when  attributes  from  more  than  one  relation  arc  involved: 

^SillPS  ^cargoes  (^-Shipnamc  =  y.Ship)  A  (y. Cargotypc  =  "refined”)  -» (x.Dcadweight  £  60) 

The  fact  that  die  semantics  of  domain  definitions,  bounding  rules,  and  productions  can  be  expressed 
as  simply  as  in  the  foregoing  examples  is  one  of  the  motivations  behind  the  choice  of  the  QUIST  data 
model  and  language. 


THE  QUIST  SYSTEM 


43 


4.3  Overview  of  the  operation  of  the  QUIST  system 

The  QUIST  system  accepts  a  query  in  the  QUIST  relational  database  query  language,  produces  a 
set  of  semantically  equivalent  queries  (possibly  including  only  the  original  query),  and  returns  the 
query  from  that  set  with  the  lowest  estimated  retrieval  cost.  In  this  section,  we  present  an  overview  of 
how  QUIST  performs  these  tasks.  We  defer  detailed  descriptions  until  Section  4.4. 

In  Section  1.1,  we  indicated  that  QUISTs  operates  in  a  mode  of  plan,  generate,  and  test  that 
appears  in  other  artificial  intelligence  programs  for  solving  a  wide  range  of  problems 
[Fcigenbaum71|.  The  purpose  of  the  planning  step  is  to  identify  both  desirable  and  undesirable 
characteristics  of  a  solution  to  the  given  problem.  These  characteristics  of  a  solution  are  used  to 
control  the  generation  step  in  which  candidate  solutions  are  produced.  Finally,  the  testing  step 
carries  out  detailed  evaluation  of  the  candidate  solutions  in  order  to  select  the  one  with  highest  merit 
Overall,  the  three  steps  are  characterized  by  the  degree  of  abstraction  at  which  the  problem  is 
addressed,  and  by  the  kind  of  search  carried  out 


4.3.1  The  planning  step  --  identification  of  constraint  targets 

The  planning  step  starts  with  the  constraints  specified  in  the  input  query.  Using  heuristics  based 
on  structure  and  processing  knowledge,  the  system  determines  which  database  relations,  if  any,  are 
constraint  targets.  A  relation  that  is  a  constraint  target  is  one  that  has  attributes  on  which  additional 
constraints  should  be  sought.  Constraint  targets  are  determined  by  viewing  the  query  only  in  terms 
of  which  relations  it  involves,  either  through  constraints  or  through  selection  for  output.  The  search 
space  is  very  simple,  consisting  merely  of  assignments  of  relations  to  the  sets  of  targets  and 
nontargets.  Incidentally,  the  concept  of  constraint  targets  should  not  be  confused  with  the  term 
target  list,  commonly  used  to  describe  which  attributes  arc  to  be  output  from  the  database.  Instead  of 
target  list,  we  use  the  term  output  attributes. 

If  there  arc  no  constraint  targets,  QUIST  merely  returns  the  original  query  unchanged.  In  such  a 
case,  QUIST  has  determined  that  it  is  not  worthwhile  to  generate  equivalent  queries  because  no 
equivalent  query  is  likely  to  cost  less  to  process  than  the  original  query.  On  the  other  hand,  if  there 
are  constraint  targets  then  QUIST  continues  on  to  the  generation  of  semantically  equivalent  queries 


4.3.2  The  generation  step  -  production  of  constraints  and  semantically  equivalent 
queries 

The  generation  step  consists  of  a  cycle  of  constraint  production  operations  repeated  until  no  more 
constraints  arc  produced.  Each  cycle  of  constraint  production  retrieves  relevant  knowledge  base 
rules,  filters  them  according  to  structurally-based  criteria  (that  is,  the  list  of  constraint  targets),  tests 
them  for  applicability  against  the  current  constraints,  and  asserts  new  constraints  if  possible.  The 
process  terminates  when  some  cycle  fails  to  generate  new  constraints. 
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The  generation  step  treats  the  query  less  abstractly  than  the  planning  step  docs.  It  must  use  the 
precise  constraints  on  database  attributes,  not  merely  the  names  of  constrained  or  output  attributes. 
The  search  at  the  generation  level  is  through  a  space  of  semantically  equivalent  queries.  Each  move 
consists  of  the  production  of  another  constraint.  Only  plausible  moves  are  permitted  because  the  * 
constraint  target  list  produced  by  the  planning  step  permits  only  those  those  transformations  that  may 
possibly  lower  the  cost  of  processing. 


4.3.3  The  testing  step  --  selection  of  the  query  with  lowest  estimated  cost 

The  generation  step  produces  one  or  more  QUIST  queries  that  are  known  to  produce  the  same 
answer.  In  the  testing  step,  each  query  is  analyzed  by  conventional  query  optimization  methods. 
This  yields  an  estimated  lowest  cost  to  perform  each  query.  The  query  with  the  minimum  estimated 
lowest  cost  is  determined. 

\ 

At  the  testing  level,  the  search  is  through  a  space  of  physical  realizations  of  a  single  logically  ^ 
expressed  query.  The  query  itself  must  be  analyzed  in  the  greatest  detail,  in  terms  of  the  actual 
database  files  it  accesses  and  the  sequence  in  which  it  accesses  them.t 


4.3.4  Summary  of  QUIST  operations 


In  describing  the  detailed  operation  of  the  QUIST  system,  it  is  convenient  to  distinguish  the  three 
steps  just  described  plus  the  task  of  grouping  inferred  constraints  into  semantically  equivalent 
queries.  To  summarize  the  operations,  then,  the  following  steps  take  place: 


1.  Identification  of  constraint  targets  (the  planning  step) 

2.  Inference  of  new  constraints  (part  of  the  generation  step) 

3.  Grouping  of  constraints  into  the  set  of  semantically  equivalent  queries  (conclusion  of  the 
generation  step) 

4.  Estimation  of  the  minimum  processing  time  for  each  query  and  selection  of  the  query 
with  the  lowest  es,:.-nated  processing  time  (testing  step) 

We  now  present  an  example  to  illustrate  these  steps. 


^Thc  problem  is  nonetheless  abstracted  in  the  sense  that  the  actual  query  is  not  carried  out:  rather,  the  cost  to  perform  it  is 
just  estimated. 
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4.4  Example  of  the  operation  of  the  QUIST  system. 

In  this  section,  we  begin  an  example  of  the  operation  of  the  QUIST  system.  The  example  brings 
out  the  kinds  of  knowledge  that  semantic  query  optimization  requires,  and  shows  precisely  how 
QUIST  integrates  the  different  knowledge  sources  into  an  effective  system.  In  particular,  the 
example  is  used  to  identify  a  set  of  heuristics  that  guide  the  inference  of  new  constraints.  These 
heuristics  are  specific  to  relational  databases  as  described  in  Chapter  2. 

The  example  is  specifically  tailored  to  illustrate  the  special  capabilities  of  semantic  query 
optimization.  In  particular,  it  involves  both  the  addition  and  the  elimination  of  relations  from  a 
query.  The  example  also  illustrates  how  structural  knowledge  is  used  to  halt  a  particular  line  of 
constraint  generation  when  the  constraints  appear  to  offer  no  hope  of  reducing  the  cost  of  query 
processing. 

QUIST’s  operation  is  illustrated  using  the  relational  database  illustrated  in  Figure  4-1: 

SHIPS  (Shipname  Owner  Shiptype  Draft  Deadweight  Capacity  Registry) 

PORTS  (Portnamc  Country  Depth  Facilitytype) 

CARGOES  (Ship  Destination  Shipper  Cargotypc  Quantity  Dollarvaluc  Insurance) 

OWNF.RS  (Owncrnamc  Location  Assets  Business) 

POLICIES  (Policy  Issuer  Coverage) 

INSURERS  (Insurer  Insurercountry  Capitalization) 

Figure  4-1:  Example  database  relations 

QUIST  operates  with  an  attributc/constraint  data  model.  Specifically,  it  treats  the  database  as  a 
single  virtual  relation.  It  is  therefore  necessary  to  specify  the  unique  logical  access  paths  among  the 
real  relations.  The  joins  that  underlie  the  permitted  logical  access  paths  are: 

1.  OWNERS.Owncrname  =  SHIPS.Owner 

2.  SHIPS.Shipname  =  CARGOES.Ship 

3.  CARGORS.Dcstination  =  POR’l'S.Portnamc 

4.  CARGOES.lnsurance  =  POLIClES.Policy 

5.  POLICIES.Issuer  =  INSURERS.Insurer 

which  stand  for,  respectively, 
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1.  A  ship  and  its  owner 

2.  A  ship  and  a  cargo  it  is  carrying 

3.  A  cargo  and  its  destination  port 

4.  A  cargo  and  the  policy  that  insures  it 

5.  A  policy  and  its  issuing  company 

The  database  is  assumed  to  be  implemented  by  means  of  one  file  per  relation.  It  is  further 
assumed  that  the  SHIPS  file  has  a  clustering  index  on  its  OWNER  attribute.  This  means  that  the 
SHIPS  file  is  clustered  with  respect  to  the  OWNERS  file;  given  a  tuple  in  the  OWNERS  file,  the 
corresponding  tuples  in  the  SHIPS  file  (that  is,  the  ships  owned  by  that  owner)  can  be  accessed  much 
less  expensively  than  by  a  sequential  search  of  SHIPS.  In  addition,  the  OWNERS  file  is  much 
smaller  than  the  SHIPS  file. 

A  very  simple  knowledge  base  of  general  semantic  rules  accompanies  the  database  in  this  example. 
We  don't  wisli  to  claim  the  validity  of  alt  these  rules:  they  arc  merely  useful  illustrations  of  the  kinds 
of  rules  dial  can  be  used  for  semantic  query  optimization.  The  rules  are: 

•  Rule  Rl.  Every  ship  over  350  thousand  tons  deadweight  can  operate  only  at  ports  with 
offshore  load/dischargc  capabilities. 

•  Rule  R2.  Only  leasing  companies  own  vessels  that  exceed  300  thousand  tons  deadweight 

•  Rule  R3.  A  cargo  is  never  insured  for  more  than  its  dollar  value. 

•  Rule  R4.  A  ship  carries  no  more  cargo  than  its  rated  capacity. 

•  Rule  R5.  Any  cargo  other  than  liquefied  natural  gas  or  refined  petroleum  products  that  is 
worth  more  than  500,000  dollars  is  handled  only  at  general  cargo  ports. 

•  Rule  R6.  The  only  ships  whose  deadweight  exceeds  150  thousand  tons  are  supertankers 
or  aircraft  carriers. 

•  Rule  R7.  Cargoes  worth  over  three  million  dollars  and  carried  by  supertankers  are 
insured  by  policies  issued  by  Lloyds. 

•  Rule  R8.  Ships  owned  by  petroleum  companies  only  carry  liquefied  natural  gas,  refined 
petroleum  products,  or  crude  oil. 

1 1.  ulcs  in  the  example  knowledge  base  arc  represented  to  QUIST  as: 

Rl:  (Deadweight  >  350)  -♦  (Facility type  =  “offshore") 

R2:  (Deadweight  >  300)  -*  (Business  =  "leasing") 
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R3:  (Coverage  £  Dollarvaluc) 

R4:  (Quantity  £  Capacity) 

RS:  (Cargotype  g  {"LNG"  "refined”})  A  (Dollarvaluc  >  500)  — *  (Facilitytype  =  “general") 

R6:  (Deadweight  >  150)  -» (Shiptype  £  {“supertanker"  “carrier"}) 

R7:  (Dollarvalue  >  3000)  A  (Shiptype  =  “supertanker”)  -» (Issuer  =  “Lloyds”) 

R8:  (Business  =  “petroleum”)  -» (Cargotype  £  {"LNG”  “refined”  “oil”}) 

The  subject  of  our  example  is  the  following  query: 

u List  the  destination  of  cargoes  worth  less  than  one  million  dollars  being  carried  by 
supertankers  over  400  thousand  tons  deadweight  to  ports  with  offshore  load/discharge 
facilities.  ” 

Note  that  this  query  was  invented  specifically  to  illustrate  semantic  query  optimization  capabilities. 
As  motivation,  consider  a  shipping  analyst  who  wishes  to  detect  cases  in  which  very  large  ships  are 
being  employed  wastefully  so  that  they  can  be  rerouted  to  more  profitable  activities. 

The  representation  of  this  query  to  QUIST  is: 

Q:  (Deadweight  >  400)  A  (Shiptype  =  “supertanker”) 

A  (Dollarvalue  <  1000)  A  (Facilitytype  =  “offshore”); 

(?  Destination) 

Processing  query  Q  as  given  involves  three  relations,  SHIPS,  CARGOES,  and  PORTS,  and  two 
joins  among  them:  SHIPS  to  CARGOES,  and  CARGOES  to  PORTS.  However,  we  readily  sec  that 
semantic  rule  R1  makes  the  constraint  on  Facilitytype  superfluous.  If  this  constraint  is  eliminated, 
then  it  won’t  be  necessary  to  involve  TORTS  in  the  processing  at  all.  PORTS  is  involved  in  Q  only  to 
restrict  tuples  in  CARGOES,  and  it  turns  out  that  this  restriction  is  unnecessary. 

Moreover,  the  constraint  on  Deadweight  also  makes  it  possible  to  infer  a  constraint  on  the  Business 
attribute  of  the  OWNERS  file.  Although  this  introduces  a  join  to  a  new  file,  the  database  is 
structured  so  tnat  this  may  be  advantageous.  This  is  because  this  join  has,  in  effect,  been 
precomputed  and  stored  as  the  link  from  OWNERS  to  SHIPS. 

In  addition,  a  constraint  can  be  inferred  on  the  Coverage  attribute  of  the  POLICIES  relation  by 
means  of  rule  R4.  However,  it  is  not  desirable  to  involve  POLICIES  in  the  query  because  the  join  to 
CARGOES  is  not  supported  by  a  prestored  link  or  index. 

We  now  discuss  how  QUIST  handles  this  query. 
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4.4.1  Step  1  -  Identification  of  constraint  targets 

QUIST’s  first  step  establishes  inference  goals.  The  task  of  this  goal-setting  step  is  to  accept  the  list 
of  attributes  that  arc  constrained  or  designated  for  output  by  the  query,  and  to  return  a  (possibly 
empty)  list  of  target  relations  on  which  the  placement  of  additional  constraints  may  be  worthwhile. 

In  this  step,  QUIST  detennines  whether  it  seems  worthwhile  to  seek  to  transform  the  given  query 
into  an  equivalent  one.  If  it  does  seem  worthwhile,  then  QUIST  seeks  to  identify  what  opportunities 
exist  for  cost-reducing  transformations.  However,  it  may  be  the  case  that  it  is  not  worthwhile  to 
transform  die  given  query.  For  example,  the  query  restrictions  might  consist  of  just  a  single 
constraint  which  happens  to  be  on  an  attribute  with  a  clustering  index.  No  additional  constraints  can 
reduce  the  processing  effort  for  that  query.  Any  effort  devoted  to  inference  would  then  be  wasted. 
Even  if  inference  is  not  ruled  out,  there  will  probably  be  only  a  few  relations  on  which  the  placement 
of  additional  constraints  will  reduce  query  processing  effort  Pruning  the  set  of  target  relations  can 
significantly  reduce  useless  inference  effort 

To  produce  a  set  of  constraint  targets  from  a  set  of  constrained  or  output  attributes,  QUIST  uses 
constraint  generation  heuristics.  These  heuristics  are  based  upon  knowledge  about  the  structure  of 
the  database  and  about  the  factors  that  contribute  to  the  cost  of  retrieval.  The  heuristics  reflect  the 
expert  knowledge  developed  from  analysis  of  rclauonal  database  query  processing. 

By  what  criterion  should  target  relations  be  chosen?  The  gr^eral  answer  is  that  a  relation  should 
be  die  target  for  the  generation  of  constraints  if  and  only  if  the  placement  of  such  constraints  on  the 
relation  makes  some  retrieval  operation  less  expensive  or  renders  it  unnecessary  altogether. 

We  can  make  this  criterion  more  specific  in  the  context  of  the  retrieval  operations  for  rcstrict-join- 
projcct  queries  discussed  in  Section  2.3.  The  major  operations  arc  scanning  a  relation,  and  joining 
two  relations. 

4.4.1. 1  Scanning  a  relation 

We  first  consider  scanning  a  relation.  A  rclauon  can  be  scanned  in  three  ways:  by  a  segment  scan, 
by  a  scan  using  a  nonclustered  index,  or  by  a  scan  using  a  clustered  index.  A  segment  scan  looks  at 
every  page  in  the  segment  that  contains  the  rcladon.  A  scan  with  a  nonclustered  index  looks  (more  or 
less)  at  one  page  for  every  qualifying  tuple. 

As  for  a  clustered  index  scan,  we  introduce  the  concept  of  restriction  selectivity  [Yao79J.  Selectivity 
is  a  fraction  between  0  and  1.  It  corresponds  to  die  fraction  of  tuples  in  a  relation  diat  meets  some 
constraint.  The  stronger  the  constraint,  the  closer  selectivity  is  to  0.  Let  attribute  A  have  a  clustered 
index.  If  constraint  C  is  imposed  on  A,  and  C  has  a  selectivity  value  of  RSEI.,  then  a  clustered  index 
scan  via  attribute  A  using  constraint  C  retrieves  approximately  a  fraction  RSEL  of  the  pages  on  which 
the  relation  is  stored. 

Consideration  of  th^sc  alternatives  leads  to  the  generalizations  noted  in  Section  2.5: 
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•  Gl.  A  restriction  on  an  attribute  that  is  not  indexed  leads  to  an  expensive  scan. 

•  G2.  A  restriction  (other  than  an  equality  predicate)  on  an  indexed  attribute  where  the 
index  is  not  a  physically  clustering  index  leads  to  an  expensive  scan. 

•  G3.  A  restriction  on  a  physically  clustering  index  can  be  processed  efficiently. 

These  generalisations  give  us  the  following  inference  guiding  heuristic: 

HI.  Try  to  exploit  a  clustered  index.  Try  to  obtain  a  constraint  on  an  attribute  of  a  relation 
which  is  restricted  in  the  query  and  which  has  a  clustered  indexed  attribute  that  is  not 
restricted  in  the  query. 

Another  heuristic  arises  from  the  same  generalizations.  It  involves  a  clustering  link  between  two 
relations  that  effectively  precomputes  and  stores  the  join  between  them.  There  is  a  clustering  link 
from  relation  X  to  relation  Y  if  each  tuple  in  relation  X  has  a  pointer  to  the  corresponding  tuples  of 
relation  Y  and  those  corresponding  Y  tuples  arc  physically  grouped  together.  I'he  actual  join  can  be 
performed  with  X  as  the  outer  relation  and  Y  as  the  inner  relation.  That  is.  X  is  scanned  and  for  each 
qualifying  tuple,  the  pointer  gives  the  corresponding  tuples  of  Y  quite  inexpensively.  The  same 
effect  is  achieved  if  Y  has  a  clustering  index  on  the  attribute  by  which  it  is  joined  to  X. 

From  the  perspective  of  scanning  relauon  Y.  however,  the  prestored  join  with  X  opens  another 
opportunity  to  reduce  retrieval  cost.  If  X  is  much  smaller  than  Y  and  if  an  effective  constraint  on  X 
can  be  found,  then  the  clustering  link  can  be  followed  to  extract  qualifiers  from  Y  inexpensively. 
One  way  to  look  at  this  is  to  regard  X  as  the  parent  of  Y  in  a  hierarchy.  Constraining  the  parent 
relation  is  very  effective  for  constraining  the  child  relation. 

The  surprising  aspect  is  that  it  can  be  advantageous  to  scan  Y  via  a  join  from  X  even  if  X  does  not 
appear  in  the  original  query.  That  is,  the  cost  of  the  overall  query  can  be  reduced  by  introducing  an 
additional  file  and  an  additional  join.  This  is  one  case  that  is  clearly  contrary  to  the  intuition 
expressed  in  conventional  query  optimization  research.  The  exploitation  of  a  clustering  link  is 
expressed  in  the  following  heuristic: 

H2.  Push  a  constraint  up  a  hierarchy.  A  relation  should  be  a  constraint  target  if  it  has  a 
clustering  link  into  a  much  larger  file  that  is  constrained  in  the  query,  even  if  the  relation 
itself  is  not  in  the  original  query. 

For  the  most  part,  however,  it  is  not  a  good  idea  to  introduce  an  additional  relation  and  extra  join 
operations  into  a  query  for  the  obvious  reason  that  joins  arc  normally  expensive.  This  advice  is 
summed  up  in  the  heuristic: 

H3.  Don’t  introduce  unlinked  joins.  With  the  exception  of  (he  clustering  link 


50 


THE  QU1ST  SYSTEM 


(pareni/child)  case,  do  not  generate  constraints  for  relations  that  are  not  part  of  the  original 
query. 

4.4.1. 2  Joining  two  relations 

Wc  now  consider  the  join  operation.  Regardless  of  the  method  chosen  to  perform  the  join,  we 
have  noted  in  Section  2.5  that 

•  G4.  The  cost  of  joins  generally  dominates  the  overall  cos?  of  processing. 

•  G5.  A  join  between  two  large  and  weakly  restricted  relations  is  very  expensive. 

Thus,  much  of  our  concern  in  finding  new  constraints  centers  on  reducing  the  cost  of  joins.  We  have 
considered  performing  joins  by  two  methods:  the  nested  loops  method  and  the  merging  scans 
method.  For  simplicity  in  QUIST,  we  assume  that  all  joins  are  carried  out  by  the  nested  loops 
method,  but  much  of  the  justification  for  the  ensuing  inference  heuristics  holds  for  either  method. 

In  the  nested  loop  method,  one  relation  (called  the  outer  relation)  is  scanned,  and  for  each  outer 
tuple  that  meets  the  constraints  on  that  relation,  the  second  relation  (called  the  inner  relation)  is 
scanned.  The  inner  scan  seeks  qualifying  inner  relation  tuples  that  match  the  outer  tuple  on  the  join 
attributes.  We  noted  in  Chapter  2  that  “the  cost  of  the  nested  loops  method  is  the  cost  of  scanning 
the  first  relation  plus  the  product  of  the  number  of  qualifying  first  relation  tuples  with  the  cost  of 
scanning  the  second  relation.”  This  presents  three  opportunities  to  reduce  the  cost  of  the  join  by  the 
generation  of  constraints:  reduce  the  cost  of  the  outer  scan,  reduce  the  number  of  qualifying  outer 
tuples,  and  reduce  the  cost  of  the  inner  scan. 

Wc’vc  already  discussed  how  to  reduce  the  cost  of  scanning  a  relation,  so  wc  take  up  the  question 
of  how  the  generation  of  constraints  can  help  to  reduce  the  number  of  qualifying  tuples  in  the  outer 
scan.  Let's  first  consider  the  underlying  intuition.  Suppose  two  relations  X  and  Y  are  to  be  joined 
and  that  both  arc  restricted  on  some  of  their  attributes.  From  the  point  of  view  of  X,  the  join  to  the 
restricted  relation  Y  can  simply  be  seen  as  a  somewhat  more  indirect  restriction  than  the  simple 
constraints  on  X’s  attributes.  That  is,  for  some  tuples  in  X  that  otherwise  meet  the  restrictions  on  X’s 
attributes,  there  arc  no  corresponding  tuples  in  Y,  hence  those  tuples  of  X  do  not  participate  in  the 
join. 

Wc  would  like  to  translate  this  indirect  restriction  into  a  simpler  one  in  terms  of  constraints  on 
attributes  of  X  so  that  it  can  be  applied  prior  to  the  cross  referencing  scan  that  makes  the  join 
expensive  to  perform.  A  constraint  on  an  attribute  can  be  applied  much  less  expensively  than  a 
constraint  imposed  indirectly  through  a  join. 

Let’s  make  this  clearer  with  an  example.  Suppose  wc  request  the  owners  of  French  ships  carrying 
cargoes  of  refined  petroleum  products: 

Q:  (Registry  =  “France”)  A  (Cargotypc  =  “refined");  (?  Owner) 
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This  involves  a  join  between  SHIPS  and  CARGOES.  The  requirement  that  each  SHIPS  tuple  be 
joined  to  a  restricted  CARGOES  tuple  can  be  viewed  as  another  restriction  on  SHIPS.  However,  the 
subset  of  French  ships  that  are  carrying  refined  products  can’t  be  determined  prior  to  perforating  the 
join  as  the  query  is  stated.  If  SHIPS  is  the  outer  relation,  there  will  have  to  be  a  scan  of  the  inner 
relation  CARGOES  for  every  French  ship. 

Now,  suppose  there  is  a  general  rule  that  only  ships  with  a  deadweight  under  60  thousand  tons 
carry  refined  products.  This  is  represented  as  the  QUIST  rule: 

R:  (Cargotype  =  “refined”)  — » (Deadweight  <  60) 

The  attribute  Deadweight  is  on  the  SHIPS  relation  and  the  attribute  Cargotype  is  on  the 
CARGOES  relation.  Rule  R  makes  it  possible  to  infer  the  constraint  (Deadweight  <  60)  “across  the 
join  boundary”  from  the  CARGOES  relation  to  the  SHIPS  relation.  The  transformed  query  Q'so 
obtained  is: 

Q':  (Cargotype  =  "refined")  A  (Registry  =  "France")  A  (Deadweight  <  60);  (?  Owner) 

Instead  of  having  to  scan  CARGOES  for  every  French  ship,  it  is  now  only  necessary  to  scan 
CARGOES  for  every  French  ship  of  less  than  60  thousand  tons  deadweight.  This  should  bring  about 
a  substantial  reduction  in  the  cost  of  performing  the  join. 

Reduction  in  the  number  of  qualifying  tuples  is  limited  to  the  movement  of  constraints  across  the 
join  boundary.  No  reduction  is  achieved  if  the  inferred  constraint  depends  entirely  on  constraints  on 
the  same  relation.  This  is  because  every  tuple  in  the  relation  that  meets  the  inferred  constraint  must 
necessarily  meet  the  supporting  constraints.  If  part  of  the  support  comes  from  constraints  on  the 
other  relation,  though,  there  will  be  a  reduction  in  the  number  of  qualifiers.  Again,  we  can  make  this 
limitation  clear  with  an  example.  Suppose  the  general  rule  stated  above  is  altered  slightly,  so  that  it 
stales  that  every  French  ship  that  carries  refined  products  must  be  under  60  thousand  tons 
deadweight: 

R:  (Cargotype  =  "refined")  A  (Registry  =  "France”)  -*  (Deadweight  <  60) 

The  constraint  on  Deadweight  can  be  still  be  inferred  and  there  is  still  a  reduction  in  the  number  of 
qualifying  SHIPS  tuples.  This  is  because  there  may  be  ships  other  than  French  ships  whose 
deadweight  is  less  than  60  thousand  tons.  But  suppose  the  rule  is  altered  again  to  state  that  all  French 
ships  arc  less  than  60  thousand  tons  deadweight,  regardless  of  what  they  are  carrying  or  of  any  other 
factor.  Then  the  rule  is: 

R:  (Registry  =  “France”)  -*  (Deadweight  <  60) 

and,  given  query  Q,  the  Deadweight  constraint  can  still  be  obtained.  This  time,  however,  the 
constraint  did  not  move  across  the  join  boundary.  No  reduction  in  the  number  of  qualifying  SHIPS 
tuples  is  obtained,  because  every  tuple  of  SHIPS  with  a  Deadweight  value  under  60  already  has  a 
Registry  value  of  “France”. 

From  the  discussion  of  constraint  movement  between  joined  relations,  we  conclude  that 
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H4.  Move  a  constraint  across  a  join  boundary.  A  relation  involved  in  a  join  to  a  sufficiently 
strongly  restricted  relation  is  a  target  for  constraints. 


The  qualification  that  the  joining  relation  be  “sufficiently  strongly  restricted”  arises  for  the 
following  reason.  If  a  relation  has  strong  constraints  on  its  attributes  and  it  is  10  be  joined  to  a 
relation  with  very  weak  constraints  on  its  attributes,  then  it  is  very  unlikely  that  a  usefully  strong 
constraint  can  be  inferred  from  it  across  the  join  boundary.  If  the  relation  it  is  joined  to  is  not  itself 
restricted,  then  no  constraint  can  be  moved  across  the  boundary. 

Another  qualification  must  be  added  to  the  preceding  heuristic,  related  to  clustering  links  and  to 
another  generalization  from  section  2.5: 

•  G6.  The  cost  of  a  join  decreases  substantially  as  the  strength  of  restrictions  on  the  joined 
relations  increases,  except  on  a  relation  which  is  clustered  with  respect  to  the  join  term 
(and  is  therefore  likely  to  be  tine  “inner”  relation  of  the  join  method). 

Suppose  relation  X  is  to  be  joined  to  relation  Y  and  that  there  is  a  clustering  link  from  X  to 
Y.  Then  it  is  extremely  likely  that  a  conventional  optimizer  such  as  the  System  R  optimizer 
[Selingcr79]  will  choose  to  perform  the  join  using  X  as  the  outer  relation  and  Y  as  the  inner  relation 
in  the  manner  described  earlier.  That  is,  X  is  scanned  and  for  each  qualifying  tuple,  the  pointer  (or 
equivalent  index)  gives  the  corresponding  tuples  of  Y  quite  inexpensively.  In  this  case,  no  additional 
constraint  on  Y  can  be  applied  effectively  to  reduce  the  cost  of  the  scan,  and  there  is  no  point  in 
reducing  the  number  of  qualifying  Y  tuples  by  adding  constraints  because  Y  is  the  inner  relation  of 
the  join.  Hence,  Y  should  not  be  a  constraint  target  in  this  case,  and  we  have  the  additional  heuristic: 

H5.  Don’t  push  a  constraint  down  a  hierarchy.  A  relation  should  not  be  a  target  for 
constraints  if  it  is  joined  to  a  restricted  fde  from  which  it  has  a  clustering  link  or  equivalent 
index. 

From  our  consideration  both  of  scanning  one  relation  and  of  joining  two  relations,  we  can  suggest 
another  heuristic  as  well: 


H6.  Use  a  strongly  restricted  clustered  index.  If  a  file  is  strongly  constrained  on  an 
attribute  with  a  clustered  index,  then  it  should  not  be  a  target  for  constraints. 

This  heuristic  applies  whether  the  relation  is  the  only  one  in  the  query  or  is  joined  to  other 
relations.  In  the  former  case,  the  relation  will  be  scanned  by  way  of  the  already  constrained  attribute. 
In  the  latter  case,  the  strong  constraint  on  the  indexed  attribute  makes  the  relation  a  likely  candidate 
to  be  the  inner  relation,  hence  reducing  the  number  of  qualifiers  is  not  helpful.  Besides,  the  strength 
of  the  constraint  makes  it  unlikely  that  further  reductions  in  the  number  of  qualifiers  can  be 
obtained. 
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Finally,  the  generation  of  new  constraints  makes  it  possible  to  render  some  retrieval  operations 
unnecessary.  The  target  in  this  case  is  a  query  relation  that  only  serves  to  restrict  another  relation  and 
from  which  no  information  is  to  be  output  If  the  restrictions  on  that  relation  can  be  found  to  be 
superfluous,  that  is,  derivable  entirely  from  constraints  on  other  query  relations,  then  it  can  be 
eliminated  and  the  join  to  it  eliminated  at  a  great  cost  saving.  We  sum  this  up  as  follows: 


H7.  Try  to  eliminate  a  dangling  relation.  If  a  relation  is  joined  to  just  one  other  relation  and 
none  of  its  attributes  contribute  to  the  output,  then  it  is  a  target  for  constraints. 


4.4.1.3  Summary  of  QUISTs  constraint  generation  heuristics  and  classes  of  query  transformations 

Let  us  summarize  the  discussion  of  QUIST  constraint  generation  heuristics  by  grouping  the 
heuristics  into  those  that  designate  constraint  targets  and  those  that  designate  nontargets.  The 
heuristics  that  designate  targets  are  shown  in  Figure  4-2.  With  each  of  these  heuristics,  we  indicate 
the  kind  of  query  transformation  it  contemplates,  in  terms  of  changes  in  scanning  or  joining 
operations. 

•  HI.  Try  to  exploit  a  clustered  index.  Try  to  obtain  a  constraint  on  an  attribute  of  a  relation 
which  is  restricted  in  the  query  and  which  has  a  clustered  indexed  attribute  that  is  not 
restricted  in  the  query. 

o  This  heuristic  contemplates  die  replacement  of  a  segment  scan  by  a  clustering  index 
scan.  We  refer  to  this  transformation  as  index  introduction. 

•  H2.  Push  a  constraint  up  a  hierarchy.  A  relation  should  be  a  constraint  target  if  it  has  a 
clustering  link  into  a  much  larger  file  that  is  constrained  in  the  query,  even  if  the  relation 
itself  is  not  in  the  original  query. 

o  This  heuristic  contemplates  the  addition  of  a  join  to  the  query,  referred  to  as  join 
introduction.  The  effect  of  the  added  join  is  similar  to  replacing  a  segment  scan  of 
the  linked  relation  by  a  clustering  index  scan  of  that  relation. 

•  H4.  Move  a  constraint  across  a  join  boundary.  A  relation  involved  in  a  join  to  a 
sufficiently  strongly  restricted  relation  is  a  target  for  constraints. 

o  In  this  ease,  the  objective  is  to  reduce  the  number  of  inner  scans  of  the  join  by 
obtaining  additional  restrictions  prior  to  the  cross  referencing  part  of  the  operation. 
Hence,  the  transformation  is  called  scan  reduction. 

•  H7.  Try  to  eliminate  a  dangling  relation.  If  a  relation  is  joined  to  just  one  other  relation 
and  none  of  its  attributes  contribute  to  the  output,  then  it  is  a  target  for  constraints. 

o  This  heuristic  is  aimed  at  join  elimination  by  means  of  inferring  from  other  query 
contraints  the  constraints  on  the  dangling  relation  specified  in  the  query. 

Figure  4-2:  Heuristics  that  designate  constraint  targets. 
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In  Figure  4-3,  we  show  those  constraint  generation  heuristics  that  designate  relations  that  arc  not  to 
be  targets  for  constraints. 

•  H3.  Don't  introduce  unlinked  joins.  With  the  exception  of  the  clustering  link 
(hierarchical)  ease,  do  not  generate  constraints  for  relations  that  are  not  part  of  the 
original  query. 

•  H5.  Don’t  push  a  constraint  down  a  hierarchy.  A  relation  should  not  be  a  target  for 
constraints  if  it  is  joined  to  a  restricted  file  from  which  it  has  a  clustering  link  or 
equivalent  index. 

•  H6.  Use  a  strongly  restricted  clustered  index.  If  a  file  is  strongly  constrainted  on  an 
attribute  with  a  clustered  index,  then  it  should  not  be  a  target  for  constraints. 

Figure  4-3:  Heuristics  that  designate  nontargets. 

4.4.1.4  Constraint  targets  for  the  example  query 

Let  us  now  consider  the  identification  of  constraint  targets  for  the  example  query: 

Q:  (Deadweight  >  400)  A  (Shiptype  =  “supertanker”) 

A  (Dollarvalue  <  1000)  A  (Facilitytypc  =  “offshore”); 

(?  Destination) 

The  attributes  named  in  the  query  reside  on  three  underlying  real  relations.  Attributes  are 
constrained  on  SHIPS,  CARGOES,  and  PORTS,  and  an  attribute  is  to  be  output  from  CARGOES. 

Each  of  these  three  relations  is  designated  as  a  target  for  constraints  by  heuristic  H4  (move  a 
constraint  across  a  join  boundary)  because  they  are  all  involved  in  joins  with  another  constrained 
relation,  and  because  neither  of  the  exceptions  in  heuristics  H5  (don't  push  a  constraint  down  a 
hierarchy)  or  H6  (use  a  strongly  restricted  clustered  index)  apply.  Both  SHIPS  and  PORTS  arc  also 
designated  as  targets  by  heuristic  H7  (try  to  eliminate  a  dangling  relation)  because  both  arc  joined 
just  to  CARGOES  and  neither  has  an  attribute  involved  in  the  output. 

In  addition,  the  OWNERS  relation  is  designated  as  a  constraint  target  by  heuristic  H2  (push  a 
constraint  up  a  hierarchy).  The  OWNERS  file  is  much  smaller  than  the  SHIPS  file  and  there  is  a 
clustering  link  from  OWNERS  to  SHIPS. 

Finally,  the  POLICIES  and  INSURERS  relations  arc  designated  as  nontargets  by  heuristic  H3 
(don’t  introduce  unlinked  joins).  The  inclusion  of  either  of  these  relations  would  introduce  a  costly 
join. 

Now  that  we  have  designated  appropriate  targets  for  additional  constraints,  it  remains  to  be  seen 
how  to  use  this  information  to  guide  the  semantic  query  transformation  process.  This  issue  is  taken 
up  in  the  next  section. 
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4.4.2  Step  2  ■  Generation  of  new  constraints 

Wc  now  describe  the  next  step  of  QUISTs  production  of  semantically  equivalent  queries:  the 
process  of  inferring  additional  constraints  on  database  attributes.  QUIST’s  inference  process  is  based 
upon  the  methods  of  semantic  query  transformation  of  general  relational  calculus  queries  described 
in  Chapter  3  (see  also  Section  B.4  of  Appendix  B).  We  first  show  how  the  inference  process  works  on 
the  example  query.  Following  the  example,  we  describe  QUISTs  general  rules  for  generating  new 
constraints  and  for  merging  new  constraints  with  an  existing  set  of  constraints.  We  conclude  by 
noting  the  conditions  under  which  it  is  permissible  to  introduce  a  constraint  on  an  attribute 
associated  with  a  relation  not  previously  involved  in  the  query. 


4.4.2.1  Selection  of  rules  for  the  generation  of  new  constraints 

The  constraint  generation  step  begins  with  a  set  of  query  constraints  and  with  a  set  of  relations 
designated  as  constraint  targets.  A  set  of  rules  is  then  extracted  from  QUISTs  knowledge  base. 
These  rules  arc  used  to  assert  new  attribute  constraints.  To  be  among  the  rules  selected  for  the 
assertion  of  new  constraints,  a  rule  must  pass  several  tests: 

•  Relevant.  The  rule  must  be  relevant  to  the  constraints  in  the  query.  For  a  bounding  rule, 
it  is  necessary  that  one  of  its  mutually  constraining  attributes  be  constrained  in  the  query; 
we  refer  to  this  as  the  relevant  attribute.  For  a  production,  there  arc  two  possible  ways  to 
be  relevant:  either  the  single  attribute  constrained  on  its  right  hand  side  is  involved  in  the 
query,  or  every  attribute  constrained  on  the  left  hand  side  is  involved  in  the  query.  As 
with  a  bounding  rule,  the  term  relevant  attribute  (or  attributes)  is  used.  If  relevance  is 
achieved  by  means  of  the  right  hand  side  attribute,  then  one  more  condition  must  hold: 
there  must  be  only  one  left  hand  side  constraint.  The  reason  for  this  is  to  avoid  asserting 
a  disjunction;  this  point  is  further  discussed  in  Section  4.4.2.2. 

o  For  our  example,  rules  R4,  R5,  and  R8  (Section  4.4)  are  eliminated  by  the  relevance 
test.  For  all  rules  but  these,  one  side  of  die  rule  entirely  involves  constraints  on 
Deadweight,  Dollarvalue,  Shiptypc,  or  Facilitytypc.  Rules  R4  and  R8  do  not  even 
mention  any  of  the  attributes  constrained  in  the  query.  The  right  hand  side  of  rule 
R5  constrains  Facilitytype.  Therefore,  R5  would  be  relevant  except  that  its  left 
hand  side  has  more  than  one  constraint.  Rule  R5  is  not  relevant  from  its  left  hand 
side  because  although  it  contains  a  constraint  on  Dollarvalue,  it  also  contains  a 
constraint  on  Cargotype,  hence  the  rule  fails  the  "entirely”  part  of  the  relevance 
test. 

•  Promising.  If  the  rule  is  relevant,  it  is  then  tested  to  sec  if  it  is  promising.  This  is  a  lest 
based  on  the  expected  usefulness  of  the  constraint  that  can  be  asserted  using  the  rule.  A 
bounding  rule  involves  two  attributes.  One  of  them  is  the  relevant  attribute;  the  other  is 
the  potential  site  of  the  new  constraint,  which  wc  call  the  candidate  attribute.  For  a 
production,  the  attribute  on  the  opposite  side  from  the  relevant  attribute  or  attributes  is 
the  candidate  attribute.  A  rule  is  hcuristically  promising  if  and  only  if  the  candidate 
attribute  is  associated  with  a  relation  in  the  list  of  constraint  targets.  The  point  of  this  test 
is  to  avoid  long  chains  of  inference  that  have  relatively  little  likelihood  of  producing  a 
constraint  where  wc  want  one. 
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o  The  constraint  targets  are  SHIPS,  PORTS,  CARGOFS,  and  OWNERS.  Among 
the  candidate  attributes  of  the  relevant  rules.  Coverage  and  Issuer  arc  not  associated 
with  one  of  these  relations;  they  are  associated  with  POLICIES.  Thus,  rules  R3  and 
R7  fail  the  test  of  promise:  we  don’t  wish  to  bring  in  constraints  on  the  POLICIES 
relation. 

•  Applicable.  Every  relevant  and  promising  production  must  be  tested  to  see  if  it  is 
applicable.  A  production  is  applicable  if  and  only  if  each  of  its  relevant  attributes  is 
constrained  at  least  as  strongly  by  the  query  as  by  the  rule  itself.  Every  bounding  rule  is 
automatically  applicable. 

o  Rules  Rl,  R2,  and  R6  are  still  possibilities.  It  turns  out  that  all  these  rules  are 
applicable.  For  example,  the  query  constrains  die  relevant  attribute  Deadweight 
with  the  constraint  (Deadweight  >  400).  Rule  Rl  constrains  Deadweight  with  the 
constraint  (Deadweight  >  350),  which  is  a  consequence  of  the  query  constraint. 

Note  that  rule  R3  would  have  passed  the  applicability  test,  but  that  rule  R7  would 
not  because  it  requires  a  stricter  constraint  on  Dollarvalue  than  the  one  specified  by 
the  query. 

Every  rule  diat  is  relevant,  promising,  and  applicable  can  be  used  to  determine  a  new  constraint  on  an 
attribute.  The  constraint  is  considered  to  be  effective  if  the  result  of  asserting  it  in  conjunction  with 
the  corresponding  constraint  in  the  query  results  in  a  stronger  constraint  than  the  query  constraint. 

The  following  new  constraints  can  be  asserted: 

From  Rl:  (Facilitytype  =  “offshore") 

From  R2:  (Business  =  “leasing”) 

From  R6:  (Shiptype  G  {“supertanker”  “carrier”}) 

The  first  two  of  these  constraints  are  effective  in  that  they  arc  at  least  as  strong  as  the  prior 
constraint  on  die  same  attribute.  The  last  constraint  is  not  as  strong  as  the  query  constraint  (Shiptype 
=  “supertanker”)  so  the  constraint  from  R6  is  not  effective. 

If  some  rules  pass  the  three  tests  and  some  effective  new  constraints  arc  obtained,  dicn  a  new 
round  of  constraint  generation  begins;  otherwise,  the  constraint  generation  step  ends.  In  each 
succeeding  round  of  the  constraint  generation  step,  we  seek  just  those  rules  that  were  not  applicable 
in  any  earlier  round.  For  instance,  no  applicable  production  is  allowed  to  be  used  again.  Thus,  rules 
Rl,  R2  and  R6  are  no  longer  in  consideration  after  the  first  round  of  constraint  generation. 
Furthermore,  attributes  that  have  just  been  more  tightly  constrained  in  die  last  round  are 
distinguished  from  attributes  diat  were  constrained  in  earlier  rounds;  the  set  of  relevant  attributes  in 
die  relevance  test  for  rules  must  contain  at  least  one  newly  constrained  attribute.  In  this  way,  we 
avoid  the  repeated  retrieval  of  a  rule  whose  attributes  on  one  side  are  all  constrained  but  which  has 
already  been  used  to  assert  a  constraint  or  has  been  shown  to  be  unpromising  or  inapplicable.  For 
instance,  rule  R3  will  not  be  relevant  to  die  second  round  of  constraint  generation  as  Dollarvalue  was 
not  newly  constrained  after  the  first  round.  If  Dollarvalue  receives  a  stronger  constraint  in  a  later 
round,  rule  R3  will  be  relevant  again. 
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We  start  the  second  round  of  constraint  generation  for  the  example  query.  Apart  from  the 
eviously  described  elimination  of  rules  from  consideration,  the  major  difference  in  this  round  is 
that  rule  R8  is  now  relevant  (because  the  Business  attribute  was  constrained  in  the  first_  round). 
However,  rule  R8  is  not  applicable  because  its  constraint  on  Business,  (Business  =  “petroleum”), 
does  not  follow  from  the  newly  asserted  constraint  (Business  =  “leasing”).  No  other  constraints  are 
asserted  in  this  round,  so  the  process  of  constraint  generation  terminates. 

4.4.2.2  Semantic  equivalence  transformations  in  QU1ST 

We  now  describe  the  general  conditions  related  to  the  generation  of  constraints  in  QUIST  as  the 
basis  for  semantic  equivalence  transformations  of  QUIST  queries.  The  discussion  here  appeals  to  an 
intuitive  notion  of  inference  for  the  particular  kinds  of  expressions  and  rules  admitted  by  QUIST.  In 
Appendix  B,  we  show  how  semantic  equivalence  transformations  in  QUIST  arc  actually  a  special  case 
of  such  transformations  for  relational  queries,  and  therefore  how  the  formal  definitions  advanced  in 
Chapter  3  r^ply  to  QUIST  as  well. 

As  noted  earlier,  there  are  two  kinds  of  rules  in  QUIST:  bounding  rules  and  productions.  Once  a 
rule  has  been  selected  to  try  to  generate  a  new  constraint,  the  ensuing  manipulations  are  domain- 
independent;  that  is,  they  depend  only  upon  properties  of  mathematical  and  set  operators.  It  should 
be  noted  too  that  the  result  of  any  QUIST  inference  is  the  assertion  of  a  single  constraint  on  a  single 
attribute. 

The  conjunctive  form  of  query  permitted  in  QUIST  lends  itself  to  a  simple  form  of  semantic 
equivalence  transformation.  The  restriction  portion  of  a  QUIST  query  Q  can  be  represented  as 

Q:  Ci(A,)AC2(A2)A...ACn(An). 

Query  Q  involves  constraints  on  the  set  of  attributes  {A1,A2...,An},  a  subset  of  all  the  attributes  in  the 
virtual  relation.  Each  term  C^Aj)  is  one  of  the  constraint  forms  defined  earlier. 

Given  a  conjunctive  query  Q,  die  basic  semantic  transformation  operation  of  QUIST  is  as  follows: 

1.  Select  some  semantic  rule  R  from  the  QUIST  semantic  knowledge  base. 

2.  If  possible,  use  rule  R  and  query  Q  to  produce  a  new  constraint  C'(A)  on  attribute  A. 

3.  If  a  new  constraint  C'(A)  is  produced,  combine  it  with  Q  to  form  the  transformed  query 

Q'. 

Wc  have  already  discussed  QUISTs  rule  selection  tests  and  the  heuristics  that  they  use.  Here  we 
assume  that  a  rule  R  has  been  selected  and  wc  discuss  how  a  new  constraint  C'(A)  can  be  produced. 
In  Section  44.2.3,  wc  discuss  how  the  new  constraint  can  be  merged  with  query  Q  to  form  the 
semantically  equivalent  transformed  query  Q’. 

Wc  examine  this  first  in  the  context  of  a  bounding  rule.  A  QUIST  bounding  rule  is  of  the  form: 
AtfiA2 
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where  Aj  and  A2  arc  numerical-valued  database  attributes  and  6  is  a  standard  Boolean  '-.r.parison 
operator  such  as  less-than  or  grcatcr-than.  The  bounding  rule  places  an  upper  bound  r>n  c*  e  of  the 
attributes  and  a  corresponding  lower  bound  on  the  other  (except  in  cases  of  equality  and  inequality). 
If  a  query  constrains  one  of  the  attributes  by,  say,  placing  an  upper  bound  on  it,  and  if  the  bounding 
rule  indicates  that  the  constrained  attribute  serves  as  an  upper  bound  for  the  other  attribute,  then  that 
other  attribute  inherits  the  same  upper  bound.  Similar  remarks  hold  for  a  lower  bound. 

As  an  example,  consider  bounding  rule  R4  from  the  example  knowledge  base.  It  states  that  a  ship 
carries  no  more  cargo  than  its  rated  capacity,  and  is  represented  to  QUIST  as  (Quantity  <  Capacity). 
It  is  natural  to  think  of  the  value  of  Capacity  as  providing  an  upper  bound  on  the  value  of  Quantity; 
it  is  equally  correct  to  think  of  the  value  of  Quantity  providing  a  lower  bound  on  the  value  of 
Capacity.  Suppose  a  query  contains  the  constraint  (Quantity  >  100);  that  is,  the  query  places  a  lower 
bound  on  Quantity.  Then  it  is  easy  to  see  that  a  lower  bound  constraint  on  Capacity,  (Capacity  > 
100).  can  be  inferred.  In  a  similar  way,  a  query  with  a  constraint  (Capacity  <  250)  permits  the 
inference  of  the  constraint  (Quantity  <  250).  If  the  query  instead  contains  the  constraint  (Quantity  < 
100),  then  it  is  not  possible  to  use  the  example  rule  to  infer  anything  about  Capacity,  and  similarly  for 
the  constraint  (Capacity  >  300). 

Turning  now  to  productions,  we  will  see  that  constraint  generation  draws  on  properties  of  both 
numerical  and  set  operators.  As  noted  earlier,  a  QUIST  production  is  a  rule  of  the  form 

C1(A1)A...ACn(An)-+C'(A) 

where  each  term  C.(Aj)  signifies  a  constraint  on  some  database  attribute  and  where  a  given  attribute 
appears  at  most  once  on  the  left  hand  side. 

With  QUIST  productions,  it  is  possible  to  reason  left-to-right  or  right-to-left.  In  reasoning  lcft-to- 
right,  it  is  necessary  to  show  that  the  query  constrains  all  the  attributes  on  the  left  hand  side  of  the 
rule,  and  that  every  such  rule  constraint  follows  from  the  corresponding  query  constraint  by  the 
properties  of  numerical  or  set  comparison.^  If  so,  then  the  rule’s  right  hand  side  constraint  can  be 
asserted. 

Reasoning  right-to-lcft  is  limited  to  productions  with  a  single  constraint  on  the  left  hand  side.  This 
is  because  right-to-lcft  reasoning  deals  with  the  contrapositivc  of  the  rule.  The  negation  of  a 
multiterm  conjunction  is  a  multiterm  disjunction,  but  QUIST,  in  common  with  many  other  inference 
systems,  makes  no  inferences  with  disjunctions  of  terms.  In  this  mode  of  reasoning,  if  it  is  seen  that 
the  negation  of  the  right  hand  term  follows  from  the  corresponding  query  constraint,  then  the 
negation  of  the  left  hand  term  can  be  asserted.  Obtaining  the  negation  of  a  constraint  on  a  string- 
valued  attribute  is  simply  a  matter  of  exchanging  £  for  E,  or  vice  versa.  For  constraints  on 
numerical-valued  attributes,  it  is  a  matter  of  “inverting"  the  interval  specified  in  the  constraint.  For 
instance,  the  negation  of  the  constraint  (Age  E  ((18  65]))  is  (Age  E  ((-oo  18]  (65  oo))). 


t 


In  particular,  QUIST  docs  not  set  up  inference  subgoals. 
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To  illustrate  inference  with  QU1ST  productions,  consider  a  production  that  states  that  juice  and 
bananas  are  always  carried  in  refrigerated  ships: 

(Cargotype  £  {“juice"  “bananas"})  — >  (Shiptype  =  “refrigerated”). 

Suppose  the  query  contains  the  restriction  (Cargotype  =  “bananas").  The  rule’s  constraint  on 
Cargotype  follows  from  this  corresponding  query  constraint  because  the  set  of  values  permitted  in  the 
query  is  a  subset  of  the  values  permitted  by  the  rule.  Therefore,  die  constraint  (Shiptype  = 
“refrigerated”)  can  be  asserted.  If,  on  the  other  hand,  the  query  contains  the  constraint  (Shiptype  = 
“supertanker”),  then  the  production  given  above  can  be  used  to  assert  the  constraint 

(Cargotype  (£  (“juice”  “bananas”}). 

4.4.2.3  Merging  a  new  constraint  with  an  existing  query 

We  have  seen  how  QUIST  generates  additional  constraints  on  attributes.  In  this  section,  we 
describe  in  general  how  new  constraints  arc  combined  with  an  existing  QUIST  query. 

The  result  of  any  one  of  the  inference  processes  just  described  is  the  assertion  of  a  single  new 
constraint  C'on  an  attribute  A.  The  processes  can  be  viewed  as  follows:  given  some  conjunction  T  of 
terms  C.fA^,  it  is  possible  to  infer  a  new  term  C'(A);  that  is,  T  — >  C.'  From  this  point  of  view, 
combining  the  new  constraint  with  the  old  ones  follows  along  the  lines  described  in  Section  3.6  on 
logical  transformations  in  semantic  query  optimization,  with  some  additional  factors  arising  from 
QUIST’s  joinless  representation  and  from  the  task  of  detecting  unsatisfiable  query  constraints. 

To  be  more  specific,  let  query  Q  be  represented  as  before: 

Q:  C1(A1)AC2(A2)A...ACn(An). 

By  the  logical  equivalence 

(A  A  (A  — *  B))  s=  (A  A  B) 

query  Q  can  be  transformed  t  into  the  semantically  equivalent  query: 

Q':  Q  A  C'(A) 

The  new  quer  Q'is  actually  formed  by  replacing  the  prior  constraint  C(A)  on  attribute  A  by  the 
conjunction  of  C(A)  and  the  new  constraint  C'(A).  The  resultant  constraint  is  obtained  as  follows; 

1.  If  dicrc  is  no  prior  constraint  C(A),  then  the  resultant  constraint  is  merely  C'(A). 

2.  If  die  prior  constraint  C(A)  is  stronger  than  C'(A),  then  the  resultant  constraint  remains 
C(A). 


^In  most  eases:  but  sec  Section  4.4.3.L 
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3.  If  the  new  constraint  C'( A)  is  as  strong  or  stronger  than  C(A),  then  the  resultant 
constraint  is  C'(A). 

4.  If  C(A)  and  C'(A)  overlap  in  the  sense  that  C  permits  some  values  of  A  that  C'does  not 
permit  and  vice  versa,  then  the  resultant  constraint  is  the  intersection  of  the  values  they 
permit.  For  instance,  if  C(A)  is  (A  G  {“a”  “b"})  and  C'(A)  is  (A  G  {“b"  “c”}),  then  the 
resultant  constraint  is  (A  =  “b").  An  analogous  combining  rule  is  observed  for  numerical 
interval  constraints. 

5.  Finally,  if  C(A)  and  C'(A)  conflict  in  the  sense  that  there  are  no  values  of  A  that  can 
satisfy  both  constraints,  then  die  original  query  restrictions  arc  not  satisfiable  in  the 
database;  that  is,  the  answer  to  the  originial  query  must  be  the  empty  set  Note  that  this  is 
detected  without  recourse  to  the  data. 


4.4.3  Step  3  -  Formulation  of  the  set  of  semantically  equivalent  queries 

We  now  discuss  die  last  step  of  QUlST’s  generation  phase:  the  formulation  of  a  set  of  alternative, 
semantically  equivalent  queries  from  die  constraints  generated  in  the  preceding  step.  A  simplified 
way  to  look  at  the  final  step  of  query  formulation  is  as  follows.  After  constraint  generation,  there  is  a 
set  of  constraints  on  database  attributes.  Some  of  these  constraints  must  be  part  of  the  query  while 
odicr  constraints  arc  optional.  A  constraint  is  optional  if  it  can  be  derived  from  other  query 
constraints.  One  of  the  queries  in  the  set  of  semantically  equivalent  queries  is  a  “kernel”  query,  Q0, 
that  includes  only  the  necessary  constraints.  If  no  new  constraints  arc  generated,  dicn  Q0  is  the 
original  query.  If  dicrc  are  N  additional  optional  constraints,  then  die  set  of  equivalent  queries 
includes  an  additional  2N-1  queries  generated  by  all  possible  choices  of  including  or  excluding  those 
N  constraints. 

This  account  must  be  modified  in  several  ways.  First,  it  is  not  always  possible  to  classify  every 
constraint  as  necessary  or  optional  independently  of  the  classification  of  the  other  constraints.  What 
may  happen  is  diat  two  sets  of  constraints  arc  related,  so  that  one  set  or  the  other  may  be  excluded, 
but  not  both.  Second,  die  addition  to  the  query  of  certain  derivable  constraints  may  implicitly 
introduce  new  relations  into  the  query.  Introduction  of  new  relations  is  only  permitted  if  the 
database  meets  certain  structural  constraints.  Finally,  QUIST  assumes  that  once  a  particular  (real) 
relation  is  involved  in  a  query,  every  constraint  on  attributes  of  .  mt  rclauon  should  be  part  of  the 
query.  The  reason  is  that  additional  constraints  on  a  relation  cannot  increase  the  cost  of  processing, 
given  QUlST’s  cost  measure,  the  number  of  page  fetches  from  secondary  storage.  Therefore,  the 
number  of  independently  excludable  constraints  is  reduced. 

In  the  remainder  of  diis  section,  we  indicate  how  the  set  of  equivalent  queries  is  formulated  for  our 
example.  After  that,  we  give  the  details  about  when  constraints  can  be  considered  optional,  and  when 
new  relations  can  be  introduced. 


We  started  with  the  QUIST  query 
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Q:  (Deadweight  >  400)  A  (Shiptype  =  “supertanker”) 

A  (Dollarvaluc  <  1000)  A  (Facilitytype  =  “offshore”); 

(?  Destination) 

and  the  constraints  generation  step  left  us  with  the  following  constraints: 

(Deadweight  >  400),  (Shiptype  =  “supertanker”),  (Dollarvalue  <  1000), 

(Facilitytype  =  “offshore”),  (Business  =  “leasing”). 

The  only  new  constraint  derived  in  that  step  is  (Business  =  "leasing”),  although  it  is  now  known 
that  the  Facilitytype  constraint  is  derivable  from  other  query  constraints,  namely  from  the  constraint 
on  Deadweight  using  rule  R2. 

Let  us  denote  the  five  constraints  by  C1  through  C5  as  follows: 

C. :  (Deadweight  >  400) 

C2:  (Shiptype  =  “supertanker”) 

C3:  (Dollarvalue  <  1000) 

C4:  (Facilitytype  =  “offshore”) 

C5:  (Business  =  “leasing”) 

Then  Ct  through  Cj  are  the  necessary  constraints  and  C4  and  C5  arc  the  optional  constraints.  The 
kernel  query  Q0  contains  just  the  necessary  constraints: 

Q0  =  c2  A  c2  A  C3 

which,  incidentally,  is  not  the  original  query  because  the  constraint  on  Facilitytype  has  been 
identified  as  optional.  There  are  two  optional  constraints,  each  on  a  separate  underlying  relation,  so 
there  are  three  other  equivalent  queries: 

Qj  =  Q„  A  C4  (the  original  query) 

Q2  ==  Q0  A  C5 

Q2  =  Qq  A  C4  A  Cj 

The  cost  of  the  alternative  queries  can  be  estimated  by  determining  the  real  relations  they  involve. 
The  kernel  query  Q0  involves  attributes  on  SHIPS  and  CARGOES;  that  is,  it  is  possible  not  to 
involve  PORTS  at  all,  because  PORTS  is  not  involved  in  the  output  and  the  only  constraint  on  one  of 
its  attributes  is  derivable  from  constraints  on  other  relations.  All  the  other  queries  involve  SHIPS  and 
CARGOES,  while  bringing  in  additional  relations:  Qj  adds  PORTS  (Q1  corresponds  to  the  original 
query),  Q2  adds  OWNERS,  and  Qj  brings  in  both  PORTS  and  OWNERS. 

The  OWNERS  relation  is  of  course  not  involved  in  the  original  query.  In  order  for  queries  Q2  and 
Qj  to  be  equivalent  to  the  original  query,  it  is  necessary  that  every  tuple  in  SHIPS  have  a 
corresponding  tuple  in  OWNERS.  If  this  condition  is  not  met,  then  it  is  possible  that  some  tuples  in 
SHIPS  that  satisfy  the  original  query  conditions  will  not  satisfy  the  join  condition. 
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QUIST  has  now  generated  four  equivalent  queries,  each  involving  a  different  set  of  database 
relations.  This  concludes  die  first  phase  of  die  QUIST  system.  The  next  phase  is  to  determine  which 
of  these  queries  has  the  lowest  estimated  processing  cost.  Before  we  discuss  this,  we  discuss  the 
conditions  under  which  relations  can  be  added  to  a  query  (as  OWNERS  is  added  to  get  queries  Q2 
and  Q3),  and  the  conditions  under  which  relations  can  be  dropped  from  a  query  (as  PORTS  is 
dropped  from  the  original  query  to  get  queries  Qfl  and  Q2). 


4.4.3.1  The  introduction  of  joins 

It  was  noted  earlier  how  the  query  Q  A  C'( A)  is  semantically  equivalent  to  query  Q  in  most  cases, 
if  constraint  C'(A)  can  be  derived  from  the  other  constraints  in  Q.  The  possible  exception  arises  when 
the  addition  of  C'(A)  implicitly  introduces  a  new  (real)  relation  into  the  query,  the  relation  R  to 
which  attribute  A  is  associated.  The  introduction  of  a  new  relation  also  introduces  one  or  more  joins 
to  connect  R  with  the  real  relations  already  involved  in  Q. 

This  section  discusses  the  conditions  under  which  it  is  permissible  to  introduce  new  relations  and 
new  joins  into  a  query.  Briefly,  it  is  only  all  right  to  do  so  if  no  tuples  in  the  original  query  fail  to 
satisfy  die  join  terms  that  must  be  inuoduced.  We  illustrate  this  idea  with  an  example  here.  It  is 
discussed  more  completely  in  Appendix  B. 

An  example  illustrates  die  introduction  of  joins.  Suppose  the  query  requests  the  destination  of  all 
cargoes  of  refined  petroleum  products: 

Q:  (Cargotype  =  “refined”);  (?  Destination) 

and  suppose  it  is  known  that  refined  petroleum  products  are  only  carried  by  ships  whose  deadweight 
does  not  exceed  60  thousand  tons: 

R:  (Cargotype  =  “refined”)  (Deadweight  5  60). 

Cargotype  is  associated  with  the  CARGOES  relation,  and  Deadweight  is  associated  with  die  SHIPS 
relation  that  is  not  involved  in  the  original  query  Q.  The  straightforward  query  transformation 
produces  a  request  for  the  destination  of  all  cargoes  of  refined  petroleum  products  that  are  being 
carried  by  ships  of  under  60  thousand  tons  deadweight: 

Q':  (Cargotype  =  “:  fined”)  A  (Deadweight  £  60);  (?  Destination) 

The  new  query  Q'implicitlv  introduces  a  join  between  CARGOES  and  SHIPS.  One  way  to  process 
Q'is  to  find  all  ships  not  exceeding  60  thousands  tons  deadweight,  then  to  find  the  cargoes  they  are 
carrying  and  indicate  the  destinations  for  tire  cargoes  that  arc  refined  petroleum  products.  However, 
consider  some  tuple  x  in  the  CARGOES  relation.  If  the  Ship  attribute  of  x  has  a  null  value,  or  if  it 
contains  the  name  of  a  ship  that  is  not  listed  among  the  tuples  of  the  SHIPS  relation,  then  the  join 
will  miss  tuple  x,  even  though  a  simple  scan  of  CARGOES  requested  by  the  original  query  Q  will 
return  tuple  x. 

The  difficulty  is  related  to  the  structural  semantics  (F.IMasri80b]  of  the  database.  The  fact  that 
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there  is  no  value  for  the  Ship  attribute  of  a  tuple  in  CARGOES  may  be  a  data  entry  oversight,  or  it 
may  reflect  a  database  design  decision  to  permit  null  values  and  thus  to  interpret  a  cargo  as  existing 
independently  of  any  ship  that  may  carry  it.  The  fact  that  the  latter  interpretation  results  pi  a  null 
value  in  some  field  is  an  artifact  of  the  manner  in  which  relationships  can  be  represented  in  the 
relational  model.  The  former  ease,  in  which  the  value  in  the  Ship  attribute  of  some  tuple  in 
CARGOES  docs  not  correspond  to  the  value  in  the  Shipnamc  attribute  of  any  tuple  in  SHIPS,  is 
much  more  likely  to  be  an  error  in  data  entry. 

In  any  event,  query  Q'is  semantically  equivalent  to  query  Q  if  and  only  if  for  every  tuple  in 
CARGOES  there  exists  a  corresponding  tuple  in  SHIPS,  where  “corresponding”  refers  to  tuples  in 
SHIPS  that  would  be  logically  accessed  from  CARGOES  by  way  of  the  logical  access  path  defined 
for  QUISTs  virtual  relation. 

It  is  assumed  throughout  the  QUIST  system  that  the  appropriate  structural  constraints  on  the 
existence  of  corresponding  tuples  are  enforced,  so  that  the  introduction  of  joins  is  always  permitted.^ 
The  system  could  be  modified  to  make  this  assumption  unnecessary.  It  would  be  necessary  to 
incorporate  another  test  to  sec  if  the  existence  condition  docs  in  fact  hold  when  the  introduction  of  a 
particular  join  is  considered. 

4.4.3.2  The  elimination  of  query  constraints 

As  pointed  out  in  Section  3.6,  it  is  possible  not  only  to  add  constraints  to  a  query,  but  also  to 
eliminate  constraints  if  they  are  derivable  from  other  constraints  in  the  query.  A  constraint  on  a 
single  attribute  can  be  eliminated  despite  the  fact  that  it  was  constrained  in  the  original  query, 
provided  that  another  equally  strong  or  stronger  constraint  can  be  derived  on  the  same  attribute 
based  solely  on  initial  constraints  on  other  attributes.  Similar  conditions  hold  for  the  elimination  of 
constraints  on  more  than  one  originally  constrained  attribute,  although  care  must  be  taken  to  avoid 
eliminating  sets  of  constraints  that  support  each  other's  derivation. 

The  following  example  illustrates  the  possible  pitfall  in  constraint  elimination.  Let  query  Q 
contain  the  constraints  (A3  >  10)  A  (A2  >  30)  A  (A3  >  50).  Assume  there  arc  two  production  rules,  Rj 
and  R2: 

Rj:  (Aj  >  5)  A  (Aj  >  10)  — *  (A2  >  40) 

R2:  (A2  >  25)  A  (A3  >  40)  -» (A2  >  20) 

With  rule  Rr  it  is  possible  to  infer  die  new  constraint  (A2  >  40),  with  constraints  on  A3  and  A3  in 
its  basis.  With  rule  R2,  we  obtain  (A^  >  20),  whose  basis  includes  constraints  on  A2  and  Ay  Hence, 
both  Aj  and  A2  are  candidates  for  constraint  elimination.  Yet  if  both  are  dropped,  yielding  the  query 
(A3  >  50),  then  there  is  no  guarantee  that  the  items  retrieved  by  that  query  satisfy  the  constraints  on 


This  condition  is  made  more  precise  in  Appendix 
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either  Aj  or  A^.  The  problem  is  that  he  derived  constraint  on  either  attribute  requires  a  constraint 
on  the  other  one. 

The  details  of  the  analysis  of  constraint  elimination  in  QU1ST  are  as  follows.  Suppose  that  query 
Q'has  been  formed  from  die  original  query  Q  through  several  steps  of  inference  and  merging,  and 
that  Q  constrains  attributes  A1  through  An.  For  every  attribute  A  in  this  set,  one  of  three  conditions 
holds: 

1.  Attribute  A  was  not  constrained  by  the  original  query  Q.  Dearly,  then,  the  constraint  C' 

(A)  in  Q'is  not  essential  for  obtaining  the  desired  answer  and  can  be  eliminated. 

2.  Attribute  A  was  constrained  by  the  original  query  Q  and  no  other,  stronger  constraints 
have  been  derived  on  it.  Therefore,  constraint  C'(A)  in  Q'is  essential  and  must  not  be 
eliminated. 

3.  Attribute  A  was  constrained  by  the  original  query  Q  but  other,  stronger  constraints  have 
been  obtained  on  A  during  the  inference  and  merging  that  produced  the  constraint  C'(A) 
in  Q.' 

Thus,  constraints  on  all  attributes  in  class  1  can  be  eliminated,  but  no  constraints  on  attributes  in  class 
2  can  be  eliminated.  Whether  or  not  constraints  on  class  3  attributes  can  be  eliminated  depends  upon 
what  is  called  tire  basis  of  die  contraints.  This  explicit  maintenance  and  use  of  inference 
dependencies  to  reason  about  the  necessity  of  constraints  is  akin  to  the  set-of-support  ideas  for 
derived  information  used  for  “truth  maintenance*’  systems  ([Fikcs75],  [Doylc78]). 

The  basis  of  a  constraint  C  on  some  database  attribute  A  is  defined  to  be  the  set  of  constraints  in 
the  original  query  which  must  hold  in  order  for  C  to  hold.  Before  any  steps  of  inference  and 
merging,  the  basis  of  each  constraint  imposed  in  the  initial  query  contains  just  the  constraint  itself. 

Let  Q'be  the  current  query,  and  let  {Cj(Aj) . CK(AK)}  be  the  set  of  constraints  in  Q'that  enable 

constraint  C'(A)  to  be  asserted  using  some  semantic  rule  R.Thc  basis  of  C'(A)  is  the  union  of  the 
bases  of  those  constraints.  C'(A)  is  now  merged  with  Q  according  to  the  rules  listed  in  the  preceding 
section.  When  C’(A)  is  strictly  stronger  or  weaker  than  the  prior  constraint  C(A),  the  basis  of  the 
resultant  constraint  is  simply  the  basis  of  the  stronger  constraint.  When  the  two  constraints  overlap, 
die  basis  of  the  resultant  constraint  is  the  union  of  the  bases  of  the  new  and  die  prior  constraints. 

To  return  to  the  question  of  eliminating  constraints,  first  consider  individually  each  class  3 
attribute,  dial  is,  each  attribute  that  is  constrained  in  the  original  query  Q  and  upon  which  additional 
constraints  have  been  obtained.  Let  C(A)  be  the  constraint  on  A  in  the  initial  query  Q,  and  let  C’(A) 
be  the  constraint  on  A  in  transformed  query  Q'.  Constraint  C'(A)  on  transformed  query  Q'can  be 
eliminated  if  and  only  if  constraint  C(A)  from  original  query  Q  is  not  in  die  basis  of  C'(A);  in  other 
words,  only  if  C'(A)  is  derivable  entirely  from  query  constraints  on  attributes  other  than  A. 

When  considering  the  elimination  of  constraints  from  several  attributes  that  arc  constrained  in  the 
original  query,  it  is  necessary  to  avoid  eliminating  too  many  constraints,  as  the  earlier  example 
illustrated.  That  situation  is  avoided  by  retaining  the  rule  against  eliminating  a  constraint  on  an 
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attribute  that  appears  in  its  o«i  basis,  and  employing  a  procedure  to  keep  track  of  the  ultimate  basis 
of  a  constraint  when  other  mnstraints  are  eliminated.  Suppose  there  are  several  candidates  for 
constraint  elimination.  Let  the  constraint  on  attribute  A  meet  the  test  for  single  constraint 
elimination.  Suppose  that  constraints  on  attributes  B  and  C  are  in  the  basis  of  the  constraint  on 
A.  When  the  constraint  on  A  is  eliminated,  it  should  be  dropped  from  the  bases  of  all  other 
constraints  and  replaced  by  the  constraints  on  B  and  C  (unless  they  already  appear). 

In  the  example,  suppose  «c  choose  to  eliminate  the  constraint  on  attribute  Ar  Its  basis  contains 
constraints  on  A2  and  A.y  The  constraint  on  A2  docs  contain  the  constraint  on  A r  so  the  constraint 
on  Aj  replaced  by  the  constraints  on  A2  and  Aj.  Since  Aj  already  appears  in  the  basis,  the  new  basis 
consists  of  constraints  on  A2and  Aj.  But  now,  A2  no  longer  meets  the  test  for  constraint  elimination, 
because  it  appears  in  its  own  basis.  Thus,  we  are  left  with  the  equivalent  query  (A2  >  40)  A  (A3  >  50). 
It  is  easy  to  see  by  production  rule  R2  that  items  that  satisfy  this  query  also  satisfy  the  original 
constraint  (A3  >  10). 


4.4.4  Step  4  -  Determingthe  lowest  cost  query 

The  last  task  of  the  QUIST system  is  to  take  the  set  of  queries  produced  in  the  preceding  steps  and 
to  estimate  which  one  costs  the  least  to  carry  out.  In  a  sense,  this  process  is  not  an  integral  part  of 
semantic  query  optimization  because  it  is  merely  a  matter  of  performing  a  conventional  query 
optimization  analysis  for  a  set  of  queries,  rather  than  for  a  single  query. 

QUIST’s  query  cost  estimator  is  derived  from  the  one  described  for  System  R  [Sclingcr79J.  The 
assumptions  behind  the  System  R  query  optimization  were  reviewed  in  Chapter  2.  QUIST’s  cost 
estimator  differs  from  that  of  System  R  chiefly  in  assuming  that  a  join  is  carried  out  by  the  nested 
loops  method  rather  than  choosing  between  that  method  and  tire  merging  scans  method.  This  is  a 
reasonable  simplification  and  docs  not  affect  the  heuristics;  for  instance,  it  would  still  make  sense  to 
move  constraints  across  join  boundaries  prior  to  performing  sorts.  Another  difference  from  the 
System  R  optimizer  is  that  QUISTs  cost  measure  involves  just  the  number  of  estimated  page  fetches 
rather  than  combining  this  with  an  estimate  of  CPU  activity.  However,  results  reported  in 
[Astrahan80a]  suggest  that  the  number  of  page  fetches  is  a  suitable  cost  measure  for  the  class  of 
queries  admitted  by  QUIST. 

QUISTs  estimator  differs  from  System  R’s  in  one  other  respect:  the  estimation  of  restriction 
sclcctivitics  (Section  4.4.1.1).  System  R  assumes  that  all  constraints  arc  independent,  hence  the 
estimated  selectivity  of  a  conjunction  of  constraints  is  the  product  of  the  estimated  selectivity  of  each 
constraint  alone.  On  the  other  hand,  the  QUIST  estimator  must  distinguish  between  given  and 
derived  constraints.  A  derived  constraint  is,  of  course,  not  independent  from  the  constraints  from 
which  it  is  derived.  In  QUIST,  the  estimated  selectivity  of  the  conjunction  of  a  given  constraint  and  a 
constraint  derived  from  it  is  taken  to  be  the  estimated  selectivity  of  the  given  constraint  alone. 

We  have  now  concluded  the  description  of  how  QUIST  operates.  In  the  next  chapter,  we  discuss 
the  system’s  effectiveness. 
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Chapter  5 

The  effectiveness  of  the  QUIST  system 


QUIST  has  been  implemented  to  investigate  the  design  of  effective  systems  for  semantic  query 
optimization.  Wc  described  issues  in  the  design  of  such  effective  systems  in  Section  4.1.  In  this 
chapter,  we  address  the  question  of  QUISTs  effectiveness  from  several  different  perspectives. 

In  Section  5.1,  we  present  details  of  the  cost  model  used  in  QUISTs  cost  estimation  step  (the  last 
step).  We  use  this  cost  model  to  provide  quantitative  estimates  of  the  reduction  in  the  cost  of 
processing  that  is  obtained  by  effecting  each  of  the  transformations  defined  in  Section  4.4.1.3:  index 
introduction,  join  introduction,  scan  reduction,  and  join  elimination. 

We  then  examine  timing  results  for  a  range  of  queries  in  Section  5.2.  Processing  of  the  selected 
queries  illustrates  each  of  the  indicated  transformations,  as  well  as  die  ability  of  QUIST  to  decide  that 
no  inference  is  apt  to  be  fruitful  or  to  recognize  when  the  original  query  restrictions  cannot  be 
satisfied. 

Finally,  we  take  up  the  question  in  Section  5.3  of  the  continued  effectiveness  of  QUISTs  control 
strategy  as  the  size  of  die  database  or  the  number  of  semantic  rules  increases. 

5.1  Quantitative  estimates  of  query  improvements 

This  section  presents  estimates  of  the  quantitative  improvement  that  can  be  obtained  for  each  of 
the  four  QUIST  transformations:  index  introduction,  join  introduction,  join  elimination,  and  scan 
reduction.  It  is  possible  that  the  application  of  one  of  these  transformations  results  in  a  complete 
change  in  the  sequence  of  processing  the  complete  query.  Hence,  it  is  not  possible  to  state  direcdv 
what  the  overall  effect  on  query  cost  will  be  of  any  given  transformation.  Instead,  estimates  are 
presented  for  local  changes,  as  if  the  transformation  had  no  other  effect  Additional  changes  at  the 
rcscqucncing  level  would  lower  costs  even  further. 
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5.1.1  Processing  assumptions  and  cost  formulas 

First,  wc  briefly  review  our  assumptions  concerning  how  scans  and  joins  are  performed.  Each 
relation  is  assumed  to  reside  on  a  single  file  that  is  divided  into  pages  of  P  records  each,  where  P  is  a 
systemwide  constant.  Furthermore,  it  is  assumed  that  each  file  entirely  fills  the  storage  segment  in 
which  it  resides.  The  effect  of  this  assumption  is  that  the  cost  to  perform  a  sequential  segment  scan  is 
simply  the  number  of  pages  in  the  file,  because  we  read  no  pages  associated  with  other  files.  This 
assumption  leads  to  underestimates  of  the  improvements  brought  about  by  transformations  that 
eliminate  segment  scans. 

A  join  is  performed  on  two  relations;  throughout  this  section, we  assume  that  Rj  is  the  outer 
relation  and  R2  is  the  inner  relation,  unless  stated  otherwise.  Rj  is  scanned  by  means  of  a  sequential 
scan  or  an  indexed  scan.  We  only  consider  clustering  indexes  in  this  section.  For  every  qualifying 
tuple  in  Rj,  we  find  the  corresponding  tuples  in  R2.  This  is  achieved  cither  by  a  sequential  scan  of 
R2,  or  by  a  clustering  indexed  scan  if  we  have  a  constraint  on  an  indexed  attribute  of  R2  other  than 
the  join  attribute,  or  by  what  can  be  called  a  link  scan,  a  clustering  indexed  scan  in  which  the 
clustering  index  is  on  the  join  attribute.  The  cost  to  find  the  corresponding  records  varies  according 
to  the  method  of  the  inner  scan. 

Based  on  the  processing  assumptions,  we  develop  necessary  cost  formulas.  Let  R  be  the  number 
of  records  in  the  file  that  corresponds  to  relation  Therefore,  the  file  occupies  N./P  pages,  and  the 
cost  S(RJ  to  perform  the  sequential  segement  scan  is  given  by 

S(R.)  =  N/P. 

The  cost  of  a  join  depends  on  the  number  of  qualifying  items  in  the  outer  relation.  This  in  turn 
depends  upon  the  selectivity  of  the  restrictions  on  that  relation.  Let  or  be  the  selectivity  value  of  the 
restrictions  on  relation  1L,  where  0  S  or  <  I.  Then  the  number  of  qualifying  tuples  from  relation  R  is 

®iNr 

Unless  otherwise  stated,  we  assume  that  the  outer  relation  of  a  join  is  scanned  by  means  of  a 
sequential  segment  scan.  Given  dial  assumption,  there  arc  three  cost  formulas  for  the  join  between 
relation  Rj  and  R2.  The  appropriate  formula  for  the  cost  J(RrR2)  depends  upon  how  tuples  of  R2 
are  found  to  match  the  current  qualifying  tuple  of  R^  If  R2  is  scanned  by  means  of  a  sequential  scan, 
the  cost  is 

J(RrR2)  =  Nx/P  +  a^Nj/P. 

If  R2  is  scanned  by  means  of  an  indexed  scan  on  a  clustering  index  of  an  attribute  other  than  the  join 
attribute,  then  the  join  cost  is  given  by 

J(RrR2)  =  N/P  +  ajNjOjNj/P. 

The  difference  between  these  two  costs  is  due  solely  to  the  fact  that  only  a2N2/P  pages  need  to  be 
scanned  for  each  of  the  oijNj  scans  of  R2  for  an  indexed  scan,  versus  N2/P  pages  each  time  for  a 
sequential  scan.  The  figure  for  an  indexed  scan  relics  on  the  assumption  that  the  values  permitted  by 
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the  restriction  on  the  indexed  attribute  arc  clustered  together  rather  ttum  scattered  throughout  the 
relation,  litis  is  the  ease,  for  instance,  when  a  numerical  attribute  is  confined  to  some  interval. 

If  there  is  a  clustering  index  on  the  join  attribute  of  R2,  we  assume  that  the  query  processor  or  the 
underlying  file  system  is  sophisticated  enough  to  maintain  its  "current  place”  in  the  scan  of  R2  within 
an  in-core  buffer,  so  any  page  of  R2  is  read  at  most  once.  In  other  words,  rather  than  scanning  R2 
completely  for  every  qualifier  in  Rj,  the  system  first  checks  to  see  if  the  proper  page  is  in  the  buffer. 
The  next  page  of  R2  is  brought  in  from  the  disk  only  if  it  is  not  in  the  buffer  or  if  the  end  of  the 
previous  page  is  reached  while  reading  matching  tuples.  In  this  case,  then,  the  cost  to  perform  the 
join  is  merely  the  cost  to  scan  R2  plus  the  number  of  pages  of  R2  that  hold  tuples  that  correspond  to 
Rj  qualifiers. 

To  determine  the  fraction  of  pages  with  corresponding  tuples,  we  make  the  additional  assumption 
that  there  is  a  1  to  N  relationship  between  records  of  file  R1  and  file  R2.  This  seems  a  reasonable 
assumption  for  the  kind  of  hierarchical  link  that  is  implemented  by  the  index  just  described.  Under 
this  assumption,  every  tuple  in  Rj  has,  on  the  average,  N2/N^  corresponding  tuples  in  R2.  We  refer 
to  the  inverse  of  this  as  /?,  so  that  0  <  /J  <  1.  The  fraction  of  pages  of  R2  which  are  brought  in  is 
approximately  the  same  fraction  of  pages  on  which  there  are  qualifiers  in  Rr  If  we  assume  that  those, 
qualifiers  are  bunched  and  not  randomly  scattered  throughout  R^,  then  the  fraction  of  R2  pages 
brought  in  is  simply  ar  Therefore,  wc  have  the  following  formula  for  the  join  with  an  indexed  scan 
on  the  join  attribute: 

J(RrR2)  =  Nj/P  +  OjNj/P. 

The  considerable  saving  of  this  method  (the  elimination  of  the  Nj  factor  in  the  second  term)  is  due  to 
the  clustering  of  the  two  files  with  respect  to  each  other. 


5.1 .2  Cost  improvements  from  transformations 

The  join  cost  formulas  arc  used  to  show  how  different  QUIST  query  transformations  reduce  the 
cost  of  processing  in  selected  examples. 

5.1.2.1  Index  introduction 

In  index  introduction,  a  constraint  is  obtained  on  an  index  that  was  not  previously  constrained. 
Assume  in  this  example  that  the  index  is  not  on  the  join  attribute.  Rj  and  R2  have  constraints  with 
sclcctivitcs  <Xj  and  <*2,  as  usual.  Suppose  a  new  constraint  is  inferred  on  a  clustering  index  of  Rj,  and 
assume  it  depends  at  least  partly  on  other  constraints  on  Rj  so  that  the  overall  selectivity  is  still  ay  If 
we  keep  Rx  as  the  outer  relation,  then  the  cost  of  the  join  is 

J(RvR2)  =  o^Nj/P  +  CjN^j/P. 

The  original  cost  is 
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J(R3,R2)  =  N/P  +  OjNjNj/P. 

However,  for  large  files  (large  values  of  N1  and  N/  we  expect  the  cross  product  terms  (the  ones  that 
involve  NjN/  to  dominate  the  costs,  so  there  is  only  a  marginal  improvement.  However,  if  there  is  a 
constraint  inferred  on  a  clustering  index  of  R2,  then  the  new  cost  is 

J(RrR2)  =  N/P  +  ajN^N/P 

assuming  that  the  new  constraint  docs  not  change  the  overall  selectivity  of  constraints  on  Rj. 
Considering  just  the  cross  product  terms,  if  C  is  the  original  cost  and  C'  is  the  cost  after  the 
transformation,  then  the  two  costs  are  related  by 

C  a^C 

where  a2  is  a  fraction  between  0  and  1,  hence  there  could  be  a  substantial  reduction  in  cost, 

5.1. 2.2  Join  introduction 

Suppose  R2  is  part  of  the  original  query  and  R1  is  not,  but  there  is  a  clustering  link  from  R3  to  R2; 
that  is,  R2  has  a  clustering  index  on  an  attribute  to  which  it  is  joined  to  Rr  and  R3  and  R2  are  in 
proper  sequence  with  respect  to  their  respective  joining  attributes.  If  we  infer  a  constraint  on  R}  (and 
if  suitable  structural  constraints  are  met  -  see  Appendix  B)  then  a  join  between  R3  and  R2  can  be 
introduced. 

Consider  a  case  where  the  original  query  includes  a  join  between  R2  and  some  relation  Rj. 
Assuming  that  neither  relation  is  constrained  on  an  indexed  attribute,  the  cost  of  performing  the  join 
is  • 

j(r2,r3)  =  n2/p  +  «2n2n3/p. 

if  R2  is  the  outer  relation.  The  first  term  is  the  scan  of  R2  and  the  second  term  is  the  cross  matching 
term  ofR2andR3> 

Now  suppose  Rj  enters  through  join  introduction.  There  arc  now  two  joins  to  be  done.  The  cost 
of  the  join  between  R3  and  R2  is  given  by 

J(RvR2)  =  N/P  +  a3N2/P. 

if  Rj  is  the  outer  relation.  The  two  joins  can  be  cascaded  so  that  the  cost  to  join  R2  and  only 
includes  the  previous  cross  matching  term.  Therefore,  the  total  cost  is  now 

C'=  N/P  +  OjN/P  +  a2N2N3/P. 

The  factor  in  the  final  cross  matching  term  is  assumed  to  be  the  same  after  join  introduction  as 
before;  that  is,  it  is  assumed  there  will  be  as  many  qualifiers  from  R2,  hence  as  many  scans  of  R3. 
This  is  based  on  the  assumption  that  the  constraints  on  R}  arc  inferred  from  constraints  already  on 
R2.  If  we  denote  the  original  cost  formula  as  C  =.  J(R2,R3)  and  if  we  recall  the  ratio  between  the  file 
sizes  as  /5  =  N  /N2,  then  we  find  that  the  original  cost  C  is  related  to  the  new  cost  Cby 
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C'=  C-ll-taj  +  i9)]N2/P. 

If  Rj  is  the  same  size  as  R2,  so  that  /?  =  1,  then  join  introduction  clearly  is  not  worthwhile.  But  if 
both  a1  and  /}  arc  near  zero,  then  join  introduction  can  be  quite  useful. 

5. 1.2.3  Scan  reduction 

In  this  transformation,  constraints  are  inferred  “across  the  join  boundary”.  The  point  very  simply 
is  to  reduce  the  number  of  qualifiers  in  the  outer  scan.  If  a  constraint  of  selectivity  a 'is  inferred  on 
Rj.  and  if  it  is  independent  of  the  other  constraints  already  on  R^,  then  by  considering  the  dominant 
product  term  as  we  did  for  index  introduction  we  find  that  the  new  cost  C'is  related  to  the  old  cost  C 
by  the  same  relationship: 

aC. 

5.1.2.4  Join  elimination 

A  relation  is  only  involved  in  the  query  to  constrain  another  relation;  none  of  its  attributes  are 
desired  as  output  and  it  is  only  joined  to  that  one  other  relation.  In  this  transformation,  the 
constraints  on  the  “dangling”  relation  are  shown  to  be  derivable  from  other  constraints  in  the  query, 
so  the  join  to  that  relation  is  simply  eliminated. 

Generally  speaking,  this  should  lead  to  a  reduction  in  the  cost  of  the  query  by  about  the  amount 
needed  to  perform  the  join.  Therefore,  because  a  join  is  often  very  expensive,  join  elimination  may 
bring  about  a  substantial  cost  reduction.  However,  in  the  case  of  a  clustering  link  such  as  supports 
the  join  introduction  transformation,  the  elimination  of  a  join  may  actually  increase  the  cost  of  the 
query,  so  join  elimination  would  not  then  be  desirable. 

5.2  Experiments  with  the  QUIST  system 

To  demonstrate  the  effect  of  semantic  query  optimization  on  the  cost  of  processing  queries,  a 
selection  of  QUIST  queries,  including  the  example  query  of  Chapter  4,  has  undergone  semantic 
query  optimize' on  with  the  QUIST  system.  The  queries  are  specifically  chosen  to  illustrate  various 
transformations  that  can  be  obtained  by  means  of  semantic  query  optimization,  and  to  illustrate  the 
magnitude  of  die  resulting  reductions  in  query  processing  cost.  The  query  processing  cost  estimates 
arc  based  on  die  model  of  query  processing  described  in  Section  5.1  and  depend  also  upon  the 
assumed  size  of  die  files  indicated  below.  The  stated  ume  to  perform  the  analysis  itself  comes  from 
actual  measurements,  but  depends  upon  the  implementation  of  QUIST. 

What  these  results  suggest  about  the  potential  importance  of  semande  query  optimization  is  more 
significant  than  the  specific  numbers  reported  here.  Beyond  the  particular  processing  csdmates,  the 
results  support  the  contention  that  semantic  query  optimization  can  bring  about  significant 
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reductions  in  the  cost  of  processing  queries  with  an  acceptable  overhead  for  analysis.  They  also 
suggest  the  effectiveness  of  the  inference-guiding  heuristics. 

We  assume  the  same  database  as  the  one  described  in  Section  4.4.  Physical  database  parameters 
have  been  chosen  in  order  to  give  some  idea  of  processing  time  for  a  moderately  large  database.  We 
assume  that  each  relation  in  the  example  database  corresponds  to  a  single  physical  file  with  the 
following  number  of  records: 


File 

Size 

SHIPS 

20,000 

CARGOES 

25,000 

PORTS 

2,000 

OWNERS 

200 

POLICIES 

25,000 

INSURERS 

5  00 

Table  5-1:  Assumed  File  Sizes  for  Timing  Experiments 

We  assume  that  dicre  arc  twenty  records  per  file  page  (the  same  value  used  in  [Yao79])  and  that 
the  time  per  page  fetch  is  thirty  milliseconds  (extrapolated  from  [Got!icb75]).  We  further  assume 
that  the  SHIPS  file  is  physically  clustered  with  respect  to  the  OWNERS  file. 

The  example  rule  base  contains  approximately  thirty  rules  like  the  ones  in  Section  4.4.  These  rules 
were  obtained  in  part  from  Coupcr's  “Geography  of  Sea  Transport”  [Coupcr72]. 

The  QUIST  system  is  implemented  in  Interlisp  [Tciielman78]  on  a  Digital  Equipment  Corporation 
DECSystcm20  model  60  computer. 


5.2.1  Analysis  of  individual  queries 


We  start  with  the  example  query  of  Chapter  4: 


“List  the  destination  of  cargoes  worth  less  than  one  million  dollars  being  carried  by 
supertankers  over  400  thousand  tons  deadweight  to  ports  with  offshore  load/discharge 
facilities. ” 


As  indicated  in  Section  4.4.3,  three  semantically  equivalent  queries  can  be  generated  in  addition  to 
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the  original  one.  These  include  two  useful  transformations  of  the  original  query:  join  introduction 
and  join  elimination.  First,  it  is  worthwhile  to  add  die  OWNERS  file  into  the  query  because  SHIPS 
is  much  larger  than  OWNERS  and  is  physically  clustered  with  respect  to  it  Second,  it  is  possible  to 
show  that  the  constraint  on  PORTS  is  superfluous.  Because  PORTS  is  only  joined  to  one  file  and  is 
not  involved  in  the  output,  it  can  be  eliminated.  Hence,  the  lowest  cost  transformed  query  turns  out 
to  be: 

“List  the  destination  of  cargoes  worth  less  than  one  million  dollars  being  carried  by 
supertankers  over  400  thousand  tons  deadweight  owned  by  leasing  companies.  ” 

QUISTs  conventional  query  optimization  subsystem  determines  that  for  the  assumed  file  sizes  and 
auxiliary  structures,  the  original  query  in  SHIPS,  CARGOES,  and  PORTS  can  be  processed  in  435 
seconds.  The  query  in  OWNERS,  SHIPS,  and  CARGOES  that  results  from  join  introduction  and 
join  elimination  can  be  processed  in  just  19  seconds.  The  total  time  to  perform  this  analysis  is  2.1 
seconds. 

The  reduction  in  cost  of  well  over  an  order  of  magnitude  for  Uiis  example  query  comes  from  the 
simultaneous  occurrence  of  fortunate  circumstances  in  the  query,  database  structure,  and  semantic 
rules.  Indeed,  the  query  was  specifically  chosen  to  show  what  can  happen  when  circumstances  are 
right. 

Things  are  not  always  so  well-suited  for  semantic  query  optimization,  but  there  arc  many  situations 
in  which  significant  improvements  can  be  obtained.  We  now  present  a  set  of  queries  to  illustrate  the 
specific  transformations  of  QUIST:  index  introduction,  join  c'imination,  scan  reduction,  and  join 
introduction.  Other  queries  in  the  set  illustrate  two  other  important  characteristics  of  QUIST.  First, 
QUIST  can  detect  a  query  whose  qualification  cannot  be  satisfied  because  of  semantic  integrity 
constraints  (and  which  is  therefore  a  null  query).  Second,  QUIST  can  determine  rapidly  when  there 
arc  no  opportunities  for  cost  reduction  via  semantically  based  transformations. 

For  this  set  of  queries,  it  is  assumed  that  the  database  has  clustering  indexes  on  die  Shiptypc  field 
of  SHIPS,  the  Ship  field  of  CARGOES,  the  Country  field  of  PORTS,  and  die  Issuer  field  of 
POLICIES.  The  queries  are  presented  along  with  the  rule  that  is  relevant  to  the  particular 
transformation,  the  transformed  version  of  the  query,  if  any,  and  the  resulting  change  in  processing, ;f 
any. 

1.  Index  introduction. 

Query  Ql:  “List  die  owners  of  all  ships  with  a  deadweight  greater  than  200  thousand 
tons." 

Relevant  rule:  “Any  ship  over  150  thousand  tons  deadweight  is  a  supertanker.”  (This  is  a 
change  from  example  rule  R6). 

Transformed  query:  “List  the  owners  of  all  supertankers  with  a  deadweight  greater  than 
200  thousand  tons.” 
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Result:  A  new  constraint  on  the  indexed  Shiptype  attribute  of  SHIPS. 

2.  Join  elimination. 

Query  Q2:  “List  the  shipper  and  quantity  of  liquefied  natural  gas  cargoes  carried  by 
pressurized  tankers  to  Marseilles.” 

Relevant  rule:  “Liquefied  natural  gas  is  always  carried  by  pressurized  tankers.” 

Transformed  query:  “List  the  shipper  and  quantity  of  liquefied  natural  gas  cargoes 
carried  to  Marseilles." 

Result:  Elimination  of  the  join  with  SHIPS  because  Shiptype  constraint  is  superfluous. 

3.  Scan  reduction. 

Query  Q3:  “List  the  owners  and  the  quantity  of  cargo  of  ships  carrying  refined 
petroleum  products  to  Danish  ports." 

Relevant  rule:  "Refined  petroleum  products  are  carried  by  ships  with  deadweight  under 
60  thousand  tons.” 

Transformed  query:  “List  the  owners  and  the  quantity  of  cargo  of  ships  with  deadweight 
under  60  thousand  tons  carrying  refined  petroleum  products  to  Danish  ports.” 

Result:  Constraint  on  Deadweight  can  be  applied  prior  to  cross  matching  step  of  join 
between  SHIPS  and  CARGOES  reducing  the  number  of  qualifying  SHIPS  tuples  and 
•  therefore  the  number  of  scans  of  CARGOES. 

4.  Join  introduction. 

Query  Q4:  “List  the  owners  of  supertankers  with  deadweight  over  350  thousand  tons  that 
arc  carrying  cargoes  to  French  ports.” 

Relevant  rule:  “Only  leasing  companies  own  vessels  that  exceed  300  thousand  tons 
deadweight” 

Transformed  querv;  “List  the  leasing  company  owners  of  supertankers  with  deadweight 
over  350  thousand  tons  that  are  carrying  cargoes  to  French  ports.” 

Result:  Addition  of  join  between  OWNERS  and  SHIPS  has  the  effect  of  a  more  efficient 
scan  of  SHI  PS. 

5.  Detection  of  unsatisfiable  conditions. 

Query  Q5:  “List  the  owners  of  all  bulk  cargo  ships  with  deadweight  over  200  thousand 
tons.” 

Relevant  rule:  “Any  ship  over  150  thousand  tons  deadweight  is  a  supertanker." 
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Result:  No  transformed  query.  QUIST  indicates  that  no  items  can  satisfy  the  query 
conditions. 

6.  Absence  of  opportunities  for  cost  reducing  semantic  transformation. 

Query  Q6:  “List  the  owncis  of  all  refrigerated  ships.” 


Result:  Indexed  attribute  Shiptype  is  already  constrained, 
remains  the  same. 

No  relevant  rule.  Query 

We  now  show  the  estimated  processing  times  associated  with 
database  (Table  5-2). 

these  queries  on  the  example 

Query 

Transformation 

Est.  Proc.  Time 
without  SQO  (see.) 

Est.  Proc.  Time 
with  SQO  (sec.) 

QO 

join  introduction 
&  join  elimination 

435 

19 

Qi 

index  introduction 

30 

4 

Q2 

join  elimination 

313 

37 

Q3 

scan  reduction 

1125 

519 

Q4 

join  introduction 

348 

112 

Q5 

(unsatisfiable)^ 

30 

0 

Q6 

(SQO  deemed  useless) 

3 

3 

Table  5-2:  Reduction  in  Processing  Costs  with  SQO 

Two  times  are  shown.  The  first  one  is  the  estimated  time  m  process  the  original  query  optimized  only 
by  conventional  means.  The  second  one  is  the  estimated  time  to  process  the  transformed  query,  that 
is,  with  both  semantic  and  conventional  optimization.  Again,  what  is  significant  is  the  relative 
magnitude  of  processing  times  rather  than  the  precise  times  indicated.  Note  that  the  amount  of 
processing  time  for  QUIST  itself  is  about  1  second  in  each  ease.  In  the  ease  of  the  detection  of 
unsatisfiable  conditions,  the  time  for  QUIST  analysis  is  under  half  a  second.  These  QUIST  analysis 
times  include  all  four  steps  described  in  Chapter  4,  including  the  time  it  takes  to  carry  out 


t 


The  query  is  a  null  query.  QUIST  would  no(  send  it  lo  the  database  for  processing. 
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conventional  query  optimization  for  each  alternative  query.  The  example  query  of  Chapter  4  is 
referred  to  as  QO. 

5.2.2  The  effect  of  inference-guiding  heuristics 

Another  important  factor  in  QUISTs  effectiveness  is  the  effect  of  the  inference-guiding  heuristics. 
Table  5-3  indicates  that  QUIST  spends  much  more  time  generating  constraints  when  the  heuristics 
are  not  enforced.  This  effect  is  compounded  because  when  more  constraints  are  generated,  more 
alternative  queries  may  be  formulated,  and  these  additional  queries  must  each  undergo  conventional 


optimization. 

Query 

QUIST  -  all  steps 

QUIST  - 

inference  only 

time 

time 

time 

time 

with 

without 

with 

without 

pruning 

pruning 

pruning 

pruning 

Q1 

.43  sec. 

67 

.29 

1.20 

Q2 

.90 

2.6 

.42 

.58 

CO 

a 

1.00 

1.1 

.45 

.51 

Q4 

1.30 

20.1 

.22 

.98 

Table  5-3:  Effect  oflnferencc-Guiding  Heuristics 

For  each  of  the  four  queries  noted  here,  four  timing  figures  are  given.  First  there  are  two  timings  for 
all  steps  of  QUIST,  with  and  without  pruning  based  upon  heuristics;  that  is,  with  and  without  the  use 
of  constraint  targets.  Second,  there  are  two  timings  for  just  the  inference  portion  of  QUIST  (step  2 
described  in  Chapter  4)  with  and  without  pruning.  A  larger  difference  is  seen  when  we  look  at  all 
steps  of  QUIST  rather  than  just  at  QUISTs  inference  steps.  As  noted  above,  this  reflects  the  effort  to 
estimate  the  cost  of  more  alternative  queries. 

The  effectiveness  of  inference  guiding  heuristics  is  suggested  by  the  number  of  rules  tested  in  the 
analysis  of  a  query  like  query  Q1  with  and  without  pruning.  Without  pruning,  33  rules  were  tried; 
these  included  20  separate  rules  plus  repetitions  due  to  renewed  eligibility  as  new  constraints  were 
inferred.  Of  these,  11  were  actually  found  applicable  and  were  used  to  infer  new  constraints.  When 
constraint  targets  were  established  and  used,  however  only  8  rules  were  found  eligible,  and  only  1 
was  used  to  infer  a  new  constraint.  Analysis  without  pruning  was  even  more  inefficient  because  cost 
estimates  had  to  be  found  for  additional  alternative  queries. 
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5.3  The  stability  of  QUIST’s  control  strategy 

One  measure  of  the  effectiveness  of  a  strategy  to  control  inference  ;s  how  well  it  constrains  the 
search.  In  QUIST’s  case,  die  pertinent  question  is  how  many  rules  are  tested  during  the  analysis  of  a 
typical  query.  QUISTs  control  strategy  has  been  effective  for  a  relatively  small  set  of  approximately 
thirty-five  rules.  Would  the  strategy  still  be  effective  as  the  size  of  the  database  or  the  number  of 
rules  increases? 

Our  answer  to  the  question  is  a  qualified  “yes”.  We  shall  argue  that  the  number  of  rules  that  must 
be  tested  for  a  typical  query  is  bounded,  and  that  the  bound  hardly  increases  at  all  as  the  database  or 
set  of  rules  grow.  We  also  offer  some  plausible  arguments  that  the  bound  is  small  enough  that  the 
rules  can  be  tested  efficiently. 

We  make  the  following  assumption  about  the  rules  in  order  to  simplify  the  argument: 

Al.  Simple  rules.  Each  rule  is  a  produevon  relating  two  attributes  That  is,  each  rule  is  of 
the  form  C j(A  J  — *  CfA^)  for  two  attributes  Aj  and  A? 

In  Section  4.4.2.1,  we  described  how  QUIST’s  rules  are  used  during  the  analysis  of  a  query.  In  this 
section,  wc  follow  that  description  in  order  to  establish  a  suitable  measure  for  the  effort  expended  by 
QUIST  on  a  typical  problem. 

First,  wc  establish  some  terminology.  A  rule  R  is  associated  with  an  attribute  A  if  and  only  if  A  is 
one  of  the  two  attributes  constrained  in  the  rule  (there  are  just  two,  according  to  assumption  Al).  We 
designate  by  S(A)  the  set  of  rules  associated  with  attribute  A;  this  set  need  not  be  nonempty  for  every 
attribute. 

Let  us  illustrate  what  can  happen  to  a.  rule  by  means  of  an  example.  Suppose  that  attributes  Aj 
and  A2  arc  on  const: aint  target  relations,  and  that  attribute  Aj  is  on  a  nontarget  relation.  Also 
suppose  that  the  knowledge  base  contains  the  following  rules: 

Rl:  (A(  >  40)  -*  (A2  >  200) 

R2:(A1>30)-*(A2>50) 

R3:(A1>60)-»(A2>300) 

R4:(A1<20)  — (A2>150) 

RS:  (Aj  >  25)  — *  (Aj  >  100) 

Assume  that  at  some  point  in  processing,  it  is  known  that  (A2  >  100)  but  that  no  constraint  is 
known  on  A 1  or  A3. 

Now  suppose  it  is  concluded  (by  rules  other  than  Rl  through  R5)  that  (Aj  >  50)  All  rules 
associated  with  Aj  (that  is,  those  in  S(Aj)),  fall  into  one  of  five  classifications  illustrated  by  rules  Rl 
through  R5. 
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First,  we  exclude  from  possible  use  those  rules  that  arc  not  promising  in  the  sense  that  they 
conclude  a  constraint  about  an  attribute  on  a  nontarget  relation.  Rule  R5  is  not  promising  because 
given  a  constraint  on  attribute  it  would  conclude  a  constraint  on  attribute  Aj  which  is  on  a 
nontarget  relation.  By  prior  organization  of  the  rules,  it  should  be  possible  to  determine  unpromising 
rules  once  and  for  all  at  negligible  cost 

Next,  we  see  which  rules  arc  applicable  in  the  sense  that  the  rule’s  constraint  on  A2  is  implied  by 
the  newly  inferred  restriction  on  Ar  Rules  R1  and  R2  arc  applicable,  but  rules  R3  and  R4  are  not 
Furthermore,  rule  R4  can  never  be  applicable,  so  it  can  be  excluded  from  consideration  at  any 
subsequent  point.  By  constrast,  it  is  possible  that  rule  R3  may  become  applicable  if  the  constraint  on 
Aj  is  later  strengthened  as  a  result  of  additional  inferences.  We  assume  one  “unit”  of  cost  to  check 
promising  rules  for  applicability  because  of  the  work  involved  in  comparing  the  rule  constraints  and 
the  current  restriction  on  the  attribute.  The  test  for  potential  future  applicability  falls  out  of  the 
direct  test  for  applicability  so  it  costs  no  more. 

Finally,  we  determine  if  the  applicable  rules  arc  effective  in  the  sense  that  they  produce  a  new 
constraint.  Rule  R2  is  not  effective  because  the  constraint  it  yields,  (A2  >  50),  is  weaker  than  the 
current  restriction  on  A^  Rule  R1  is  effective  because  it  yields  a  new  and  stronger  constraint,  (A2  > 
200).  In  either  case,  another  “unit"  of  cost  is  incurred  comparing  constraints  on  attribute  A2.  In 
addition,  both  rules  arc  now  “used  up”  and  excluded  from  further  testing. 

To  generalize  from  this  example,  we  assume: 

•  Unpromising  rules  incur  no  cost  and  are  excluded  from  further  testing. 

•  Inapplicable  rules  incur  one  unit  of  cost;  only  rules  that  are  potentially  applicable  later 
arc  retained  for  further  testing. 

•  Applicable  rules,  whether  effective  or  not,  incur  two  units  of  cost  and  arc  excluded  from 
further  testing. 

Hence,  the  cost  of  analyzing  a  query  can  be  determined  as  follows.  For  ever/  attribute  that  is 
constrained  once,  cither  in  the  original  query  or  by  means  of  subsequent  inference,  the  cost  equals 
the  number  of  inapplicable  associated  rules,  plus  twice  the  number  of  applicable  rules.  For  every 
attribute  that  is  constrained  twice,  the  cost  is  the  previous  cost  plus  an  additional  cost  figured  only  on 
the  basis  of  potentially  applicable  rules  left  over  after  the  first  constraint  was  asserted.  Costs  for 
subsequent  constraints  are  figured  the  same  way,  on  the  basis  of  a  dwindling  set  of  potentially 
applicable  rules. 

Therefore,  the  problem  of  determining  the  cost  of  analyzing  a  typical  query  becomes  a  problem  of 
determining  the  following  quantities: 

I.  the  number  of  attributes  that  arc  constrained 


2.  the  number  of  times  each  attribute  is  constrained 
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3.  the  number  of  associated  rules  that  are  promising 

4.  the  number  of  promising  rules  that  are  applicable 

5.  the  number  of  inapplicable  rules  that  remain  for  subsequent  testing. 

As  we  stated  above,  we  will  not  attempt  to  make  an  actual  estimate  of  the  cost  of  analyzing  a 
typical  query,  but  will  argue  that  the  cost  is  bounded,  that  the  bound  is  stable  with  respect  to  the  size 
of  the  database  and  the  rule  base,  and  that  the  bound  is  likely  to  be  “reasonable”.  We  make  several 
more  plausible  assumptions: 


A2.  Simple  queries.  Almost  all  queries  constrain  just  a  few  attributes,  no  more  than  (say) 
five) 

A3.  Strong  constraints.  Query  constraints  are  often  likely  to  be  quite  restrictive. 

A4.  Nonuniform  distribution  of  rules.  Some  attributes  have  relatively  many  associated  rules, 
many  have  relatively  few  or  none. 

Based  on  these  assumptions,  we  make  one  more  crucial  assumption: 

A5.  Limited  inference.  Only  a  small  number  of  attributes  receive  inferred  constraints,  and 
very  few  of  these  are  constrained  more  than  once. 

Our  overall  picture  of  inference  in  QUIST  is  as  follows.  A  small  number  of  attributes  are 
restricted  in  the  query  (assumption  A2).  The  process  of  generating  constraint  targets  therefore  does 
not  yield  a  large  set  of  targets.  Consequently,  only  a  relatively  small  percentage  of  the  rules 
associated  with  the  constrained  attributes  are  promising,  are  tested,  and  incur  a  cost.  The  query 
constraints  arc  probably  strong,  at  least  on  the  "important”  attributes  that  arc  involved  in  many  rules 
(assumptions  A3  and  A4).  Therefore,  very  few  new  constraints  are  inferred  and  very  few  rules 
remain  potentially  applicable.  Strong  constraints  probably  lead  to  other  strong  constraints,  so  that 
there  are  few  if  any  long  chains  of  inference  (assumption  A5). 

Returning  therefore  to  the  five  quantities  of  interest  described  above,  we  are  asserting  that  the 
number  of  contrained  attributes  is  likely  to  be  small  and  that  each  attribute  is  likely  to  be  constrained 
no  more  than  once.  Concerning  the  quantities  that  involve  numbers  of  rules,  if  the  number  of 
promising  rules  is  reasonable,  then  the  number  of  applicable  and  retested  rules  is  reasonable  too. 

It  is  this  last  question  of  die  number  of  promising  rules  that  involves  the  growth  of  the  database 
and  the  rule  base.  We  make  the  following  two  assumptions  about  the  effect  of  such  growth: 


t 


This  seems  to  be  the  experience  in  systems  like  I  ADDER  [I  Icndrix78). 


80 


THE  EFFECTIVENESS  OFTHE  QU1ST  SYSTEM 


A6.  Growth  of  the  database.  The  growth  in  the  number  of  items  in  the  database  has  no  effect 
on  the  number  of  rules. 

This  assumption  is  actually  rather  obvious  as  the  rules  merely  dictate  permitted  configurations  of 
the  data  and  arc  not  otherwise  linked  to  any  aspect  of  the  data,  including  the  quantity  of  it. 

As  for  whether  the  bound  is  reasonable  or  not,  that  depends  on  just  how  many  rules  are  likely  to 
be  associated  with  the  number  of  relations  that  arc  constraint  targets  in  a  typical  query.  There  is  no 
solid  evidence  from  prior  research  to  suggest  what  that  number  might  be,  but  contemporary  expert 
systems  in  artificial  intelligence  such  as  MYCIN  [Shortliffe76]  and  PROSPECTOR  [Duda78]  have  on 
the  order  of  a  few  hundred  rules  for  the  entire  system.  This  would  certainly  be  a  manageable 
number. 


I 
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Chapter  6 

The  significance  of 
semantic  query  optimization 


t  In  this  final  chapter,  we  discuss  the  significance  of  semantic  query  optimization  in  general  and  of 

its  formulation  in  the  QUIST  system  in  particular.  Our  work  advances  specific  ideas  about  the 
processing  of  database  queries  and  about  the  organization  of  planning  programs.  It  also  serves  as  an 
important  example  of  the  fruitful  interaction  between  research  in  artificial  intelligence  and  research 
in  databases.  We  also  discuss  the  limitations  of  the  research  and  make  suggestions  for  future 
investigations. 


t 
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6.1  Significance  for  database  research 

Semantic  query  optimization  is  significant  for  database  research  in  tying  together  research  on 
query  optimization  with  research  on  the  semantic  integrity  of  databases.  The  synthesis  provides  a 
new  and  powerful  method  of  query  optimization.  We  discuss  this  in  Section  6.1.1.  In  Section  6.1.2, 
we  compare  the  work  on  QUIST  with  a  related,  more  general  proposal,  called  KBQP,  of  Zdonik  and 
Hammer.  We  indicate  that  QUIST  is  a  significant  step  forward  because  it  provides  specific  answers 
about  how  semantic  query  optimization  should  be  carried  out  and  controlled  in  a  context  where 
query  processing  is  relatively  well  understood,  and  because  it  has  shown  specifically  by  how  much 
query  processing  can  be  improved  using  semantic  reasoning. 


6.1.1  The  relationship  of  semantic  integrity  to  query  processing 

The  semantic  integrity  of  a  database  is  insured  when  the  data  in  it  are  forced  to  meet  semantic 
integrity  constraints  that  reflect  the  real  world  application  modelled  by  the  database.  The 
development  of  semantic  integrity  notions  and  the  design  of  systems  to  enforce  semantic  integrity 
were  sketched  in  Section  1.2.4.  Through  the  work  of  Chang  (Chang78J,  El-Masri  [ElMasri80b], 
Hammer  and  McLeod  [Hammcr75],  Roussopoulos  [Roussopoulos77],  and  others,  declarative 
formalisms  have  bedn  applied  to  the  purpose  of  stating  general  laws  that  express  the  semantics  of  a 
database. 

The  development  of  the  ideas  about  semantic  integrity  constraints  was  motivated  by  one  purpose. 
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that  of  making  sure  that  the  data  in  the  database  is  meaningful.  Our  research  advances  another 
important  and  unforeseen  use  of  these  constraints.  We  have  shown  that: 

The  semantic  knowledge  about  a  database  expressed  in  semantic  integrity  constraints  can 
sometimes  be  used  to  transform  a  database  query  into  a  semantically  equivalent  query  that 
is  much  less  expensive  to  process  than  the  original  query. 

This  demonstration  thus  brings  together  the  two  apparently  quite  separate  research  areas  of 
database  integrity  and  query  optimization. 

The  notion  that  general  rules  about  the  database  can  be  applied  during  the  processing  of  a  query, 
and  not  just  during  the  validation  of  updates,  has  appeared  in  other  work,  but  not  for  tire  purpose  of 
improving  efficiency.  For  instance,  in  Chang’s  DEDUCE2  system  [Chang78],  general  semantic  rules 
are  used  to  define  virtual  relations  in  terms  of  the  basic  relations  that  are  stored  in  the  database. 
When  a  DEDUCE2  query  is  processed,  all  virtual  relations  arc  transformed  into  the  underlying  basic 
relations.  As  with  QUJST,  DEDUCE2  checks  whether  the  query  poses  conditions  that  violate 
semantic  integrity  constraints,  but  DEDUCE2  docs  not  perform  transformations  for  the  sake  of 
efficiency. 


6.1 .2  The  organization  and  effects  of  semantic  query  optimization  systems 

The  insight  advanced  by  QUIST,  that  semantic  integrity  constraints  can  be  used  for  efficiency 
transformations,  has  been  introduced  independently  by  Hammer  and  Zdonik  [Hammcr80]  under  the 
name  knowledge- based  query  processing  (KBQP).  Their  work  resembles  the  QUIST  work  in  three 
essential  respects.  First,  of  course,  they  propose  that  semantic  knowledge  about  databases  be  applied 
to  the  problem  of  efficient  query  processing.  Secondly,  they  suggest  that  the  way  to  bring  semantic 
knowledge  to  bear  on  this  problem  is  by  means  of  the  transformation  of  queries  into  equivalent 
queries.  Thirdly,  they  identify  control  of  die  query  transformation  process  as  crucial  to  the  successful 
application  of  semantic  knowledge  to  query  processing. 

However,  QUIST  makes  important  and  original  contributions  in  the  introduction  of  die  concept  of 
semantic  query  optimir'tion  and  in  the  organization  and  analysis  of  semantic  query  optimization 
systems.  To  identify  these  contributions,  it  is  convenient  to  contrast  QUIST  with  the  KBQP 
proposal. 

KBQP  is  interned  to  operate  in  die  context  of  an  abstract  data  management  system  that  treats  the 
database  as  a  collection  of  sets  of  objects.  The  data  model  resembles  the  entity-relationship  model 
[Chcn76].  By  contrast,  QUIST  operates  with  die  relational  model.  The  difference  is  significant 
.  because,  as  discussed  in  Chapter  2,  research  on  die  relational  model  has  produced  a  body  of  query- 
processing  expertise  for  which  there  is  little  counterpart  in  studies  related  to  the  entity-relationship 
model.  Indeed,  one  of  the  significant  demonstrations  of  our  research  is  that: 
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Factors  that  govern  the  cost  of  processing  a  relational  database  query  can  be  expressed  as 
expert  rules  that  can  help  control  the  query  transformation  process  of  a  semantic  query 
optimization  system. 

It  is  worth  noting  that  this  result  is  the  product  of  taking  an  artificial  intelligence  “perspective” 
toward  results  in  database  research. 

As  noted  above,  the  KBQP  proposal  recognizes  that  in  order  to  maintain  the  overall  efficiency  of 
the  system,  it  is  necessary  to  perform  only  those  query  transformations  that  may  lead  to  reduced 
query  processing  costs.  Hammer  and  Zdonik  postulate  a  set  of  what  they  terrr.  cost-reducing 
techniques  to  control  transformations.  For  example,  the  aim  of  one  such  technique,  called  domain 
refinement,  is  to  convert  the  domain  of  a  restriction  expression  into  a  smaller  one  whose  members  are 
more  readily  accessible.  The  technique  of  domain  refinement  seems  intuitively  plausible.  Indeed,  it 
corresponds  in  the  relational  context  to  QUISTs  index  introduction  transformation  (Section  4.4.1.3). 
What  the  research  on  QUIST  has  done  is  to  gc  beyond  intuitive  plausibility  in  the  context  of  a  well- 
developed  model  of  query  processing  (Section  5.1),  leading  to  the  assertion  that: 

Several  classes  of  transformations  of  relational  database  queries  that  reduce  the  cost  of 
processing  have  been  identified,  and  the  reduction  they  produce  in  the  cost  of  processing 
has  been  estimated  quantitatively  based  upon  well-developed  models  of  query  processing. 

To  control  the  application  of  their  cost-reducing  techniques.  Hammer  and  Zdonik  propose  a 
multiprocesssing  control  structure.  At  the  start  of  analyzing  a  query,  a  separate  process  is  set  up  for 
each  technique  applied  to  each  subexpression  in  the  original  query.  Each  process  is  assigned  a 
priority  based  upon  heuristics  that  reflect  the  presumed  likelihood  that  the  particular  technique  will 
succeed  and  produce  an  improvement  in  the  particular  subexpression.  An  example  of  such  a 
heuristic  is:  “assign  a  low  priority  to  a  process  that  involves  domain  refinement  applied  to  an 
expression  that  docs  not  appear  in  any  statements  about  subset  relationships  in  the  knowledge  base". 
Hammer  and  Zdonik  acknowledge  that  the  number  of  processes  is  apt  to  grow  large.  The  reason  they 
propose  such  an  elaborate  control  structure  in  spite  of  this  is  their  belief  that  it  is  necessary  to  reason 
about  transformation  goals  at  every  step  in  the  analysis  of  the  query. 

QUIST  controls  the  transformation  process  quite  dil.crently  (Section  4.3).  It  forms  constraint 
targets  in  a  separate  analysis  before  it  attempts  to  infer  any  constraints.  Because  of  this  separation, 
the  inferences  that  produce  transformed  queries  arc  carried  out  in  a  data-dircclcd  rather  than  a  goal- 
directed  manner.  That  is,  QUIST  reasons  forward  from  known  constraints  without  having  a  precise 
goal  for  that  reasoning.  This  is  not  to  say,  however,  that  QUIST  does  not  identify  which  constraints 
would  be  desirable.  This  is  exactly  what  QUIST  does  when  it  identifies  constraint  targets  (its  so- 
called  planning  step,  Section  4.3.1).  Rather,  QUIST  uses  goal  information  to  cut  ofl  unpromising 
lines  of  inference. 


The  result  is  that  QUISTs  control  strategy  is  much  less  elaborate  than  the  one  proposed  for 
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KBQP.  In  Section  5.2,  we  report  on  experiments  that  show  QUISTs  control  strategy  to  be  effective, 
at  least  for  a  limited  sample  of  QUISTs  class  of  relational  database  queries.  In  Section  5.3,  we  argue 
further  that  the  control  strategy  remains  effective  under  reasonable  assumptions  about  the  complexity 
of  the  semantic  rules  and  the  growth  of  the  database. 

The  KBQP  approach  to  control  reflects  the  philosophy  that  determining  a  suitable  query 
transformation  is  a  very  complex  problem  and  that  the  possible  improvement  in  the  query  warrants 
an  elaborate  and  possibly  expensive  analysis.  QUISTs  approach  reflects  a  different  philosophy: 
keep  the  analysis  simple  at  the  cost  of  missing  some  desirable  transformations.  At  first  glance, 
KBQP’s  approach  seems  more  general.  Yet,  QUiSTs  approach  seems  appropriate  where,  as  in  the 
case  of  its  particular  class  of  relational  database  queries,  where  storage  and  access  conventions  are 
well  established  and  cost  factors  are  well  understood.  The  point  is  that  QUIST  can  make  reasonable 
assumptions  about  the  frequency  and  the  consequent  importance  of  certain  kinds  of  constraints 
(namely,  those  on  single  attributes,  particularly  indexed  attributes).  Its  knowledge  base  and  its 
control  strategy  are  based  on  these  assumptions.  It  may  be  that  as  other  classes  of  queries  and  other 
means  of  storing  and  accessing  data  arc  better  understood,  new  QUIST-style  heuristics  can  be 
developed  and  QUiSTs  approach  will  prove  effective.  Which  philosophy  is  more  appropriate  for 
semantic  query  optimization  in  general  can  only  be  determined  by  further  research.  However,  it  can 
be  said  that: 

There  is  evidence  that  a  simple  control  strategy  that  uses  forward  reasoning  limited  by  a  set 
of  previously  computed  constraint  targets  is  effective  for  semantic  query  optimization  in 
attribute/constraint  relational  queries. 

KBQP  is  a  design  proposal  that  would  probably  require  new  machine  architectures  for  cost- 
effective  implementation.  By  contrast,  QUIST  has  been  implemented  and  tested  on  a  range  of 
queries.  It  builds  explicitly  on  assumptions  and  models  of  contemporary  research  in  query 
optimization  for  relational  databases  as  implemented  on  current  generation  serial  architectures. 

We  can  summarize  the  relationship  of  semantic  query  optimization  to  the  methods  we  have  called 
conventional  query  optimization  as  follows: 

Semantic  query  optimization  nukes  it  possible  to  achieve  substantial  improvements  in  the 
efficiency  of  processing  that  are  not  achievable  by  conventional  techniques.  At  the  same 
time,  though,  semantic  query  optimization  can  be  viewed  as  extending  the  usefulness  of 
conventional  methods  in  the  sense  that  the  purpose  of  producing  semantically  equivalent 
queries  is  to  create  new  opportunities  to  apply  conventional  query  optimization  techniques. 

Finally,  we  should  note: 

The  development  of  semantic  quay  optimization  demonstrates  the  fruitfulness  of 
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investigating  certain  database  problems  from  the  point  of  view  of  artificial  intelligence 
research. 

The  development  of  semantic  query  optimization  is  part  of  a  growing  awareness  of  this  point 
(Brodic81J  that  is  highly  significant  for  database  research. 

6.2  Significance  for  artificial  intelligence  research 

As  formulated  in  the  QU1ST  system,  semantic  query  optimization  is  significant  in  two  major 
respects.  First,  it  suggests  a  new  problem  refonnulaiion  approach  to  the  task  of  producing  a  “good” 
plan  when  there  already  is  an  existing  planning  program  to  produce  correct  plans.  Second,  it 
provides  an  example  of  intelligent  database  mediation  by  providing  intelligent  assistance  in  the  best 
use  of  database  resources.  In  Section  6.2.1,  we  identify  a  conventional  query  optimizer  as  a  planning 
program.  We  discuss  recent  research  in  planning  and  problem  solving  in  which  the  issues  of 
efficiency  and  explicit  control  of  problem  solving  emerge.  Finally,  wc  contrast  QUlST's  approach  on 
these  issues  with  the  approach  taken  in  other  planning  systems.  In  Section  6.2.2,  we  define  the 
database  mediation  task  and  note  how  QUIST  has  supplied  one  part  of  the  desired  function. 


6.2.1  The  reformulation  of  problems  for  better  solutions 

Given  a  database  query  stated  in  logical  terms,  the  problem  of  query  optimization  is  to  specify  an 
efficient  way  to  process  that  query  in  the  physical  database.  That  is  to  say,  the  problem  of  query 
optimization  is  exactly  what  is  referred  to  in  artificial  intelligence  research  as  problem-solving. 
Problem-solving  is  the  determination  of  a  sequence  of  actions  to  satisfy  a  goal.  In  q^.ry 
optimization,  the  goal  is  to  obtain  some  data  or  io  check  the  truth  of  some  assertion.  The  actions 
through  which  the  goal  can  be  satisfied  are  operations  in  the  physical  database  such  as  segment  scans 
and  indexed  scans  (Section  2.3). 

The  resemblance  between  a  query  optimizer  and  an  artificial  intelligence  problem-solving  program 
is  illustrated  by  the  System  R  query  optimizer.  As  noted  in  Section  2.4,  System  R’s  optimizer 
analyzes  the  pressing  of  an  n-relation  query  as  a  sequence  of  processing  2-ielation  queries.  Thus, 
each  2-relation  query  can  be  regarded  as  an  abstract  step  in  the  plan  to  perform  the  desired  retrieval. 
One  of  the  main  tasks  of  the  optimizer  is  to  pick  the  best  way  to  carry  out  each  2-rclation  query. 
There  may  be  many  ways  to  do  this;  in  fnct,[Yao79]  describes  a  model  that  can  generate  339  different 
methods  to  carry  out  a  2-rclation  query.  Thus,  the  optimizer  must  refine  tire  abstract  step  in  the  best 
available  way.  The  refinement  of  abstract  plan  steps  is  a  fundamental  part  of  all  recent  planning 
programs  whether  they  are  based  on  hierarchical  planning  (NOAH  [Saccrdoti77]),  best-first  search 
(LIBRA  (Kant79]),  or  orthogonal  planning  (MOLGEN  [Stcfik80J). 

The  other  task  of  the  System  R  optimizer  is  to  choose  the  best  sequence  in  which  to  perform  the  2* 
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relation  queries.  Sequencing  and  refinement  are  related  tasks.  For  example,  some  processing  steps 
compute  a  result  in  an  order  that  differs  from  the  original  order  of  the  data  from  a  given  relation. 
Any  index  on  that  relation  is  no  longer  usable  for  later  steps.  Hence,  certain  refinement  choices  arq. 
lost.  Conversely,  a  single  refinement  choice  for  a  particular  step  may  so  dominate  the  alternatives  as 
to  force  the  step  to  be  performed  early  it.  the  sequence  in  order  to  avoid  possible  invalidation.  This 
interaction  between  between  sequencing  and  refinement  is  also  seen  in  programs  like  LIBRA 
[Kant79J. 

In  the  QUIST  system,  it  is  assumed  that  a  conventional  query  optimizer  is  available  to  carry  out 
the  refinement  of  a  logically  stated  query  into  a  plan  for  execution  in  the  physical  database.  What  is 
significant  about  QUIST  is  the  following: 

A  semantic  reasoner  can  be  applied  as  a  preplanner  that  can  result  in  the  production  of 
better  plans  by  its  associated  planning  program,  without  complicating  the  planning  program 
itself. 

To  understand  the  significance  of  this  for  planning  problems  in  general,  let  us  review  some  recent 
planning  programs  in  more  detail.  The  review  focusses  on  two  issues:  the  role  of  efficiency 
considerations  in  planning,  and  the  control  of  the  planning  process  itself. 

The  PEGASUS  program  of  Sproull  [Sproull77]  was  one  of  the  first  planning  programs  to  address 
efficiency  issues  directly  ([Garvcy76]  provides  another  example).  Sproull’s  chief  concern  was  to 
integrate  the  symbolic  planning  methods  of  artificial  intelligence  with  the  considerations  of  utility 
developed  in  decision  theory.  The  basic  approach  taken  by  PFGASUS  was  to  conduct  a  search  of 
plan  alternatives  using  a  utility  function  to  measure  the  promise  of  partially  completed  plans.  The 
utility  function  did  more  than  this,  however.  It  also  provided  the  basis  forjudging  the  relative  value 
of  further  planning,  of  obtaining  more  information  about  the  (uncertain)  environment,  and  of 
carrying  out  proposed  plan  steps.  Thus,  the  PEGASUS  planner  controlled  its  own  activities  using  the 
same  utility  functions  it  employed  to  select  the  best  plan.  The  overall  goal  of  PEGASUS  was  to 
achieve  optimal  behavior  measured  in  terms  of  the  combined  utility  of  tine  execution  of  the  completed 
plan  and  of  the  planning  process  itself. 

Kant’s  LIBRA  program  [Kant79]  also  considered  efficiency  explicitly.  Its  goal  was  to  take  a  high- 
level  description  of  a  program  and  to  transform  it  into  an  efficient  program  that  could  actually  be 
executed.  Knowledge  about  how  to  transform  a  program  was  contained  in  coding  rules  developed  by 
Barstow  [Barstow79J.  LIBRA'S  task  was  to  decide  which  of  possibly  many  coding  rules  to  apply  at 
any  point.  It  used  efficiency  rules  to  do  this  (and  in  this  respect  is  very  much  like  QUIST).  The 
efficiency  rules  reflected  both  heuristic  and  analytical  estimates  of  the  cost  of  alternative  refinements. 
In  addition,  LIBRA  used  resource  allocation  rules  to  decide  which  part  of  the  program  description  to 
refine  first.  Choosing  to  refine  some  parts  before  others  could  greatly  reduce  the  number  of 
refinement  alternatives  that  had  to  be  considered. 

The  MOLGHN  program  of  Stcfik  [Stcfik80j  advanced  the  notions  of  metaplanning  and  constraint 
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posting.  The  idea  of  mctaplanning  is  that  the  planning  process  itself  should  be  controlled  by  a  similar 
planning  process  (a  similar  notion  appears  in  [Hayes-Roth78]),  responsible  for  such  activities  as  focus 
of  attention  at  the  planning  level.  Constraint  posting  is  the  idea  that  decisions  should  be  postponed 
until  the  constraints  arising  from  commitments  or  guesses  elsewhere  in  the  developing  plan  are 
propagated.  This  reduces  the  number  of  alternatives  that  must  be  considered. 

These  programs  illustrate  several  major  themes  of  current  research  in  problem  solving  and 
planning: 

•  Planning  proceeds  by  adding  constraints  to  a  partially  completed  plan. 

•  The  programs  reason  explicitly  about  the  control  of  the  planning  process. 

•  Decisions  about  how  to  refine  a  particular  segment  of  a  plan  arc  intermixed  with 
decisions  about  what  the  planning  program  should  do  next. 

•  In  many  cases  (PEGASUS  and  LIBRA,  for  example)  it  is  desired  to  produce  not  just  a 
correct  plan  but  a  ‘‘good”  plan,  and  furthermore,  it  is  desired  to  that  the  planner  itself  be 
efficient. 

In  query  optimization,  including  semantic  query  optimization,  we  are  obviously  concerned  with 
the  quality  of  the  final  plan,  as  measured  by  its  efficiency.  We  are  also  concerned  with  the  efficiency 
of  the  planning  process.  Where  QUIST  differs  from  contemporary  planning  programs  is  in  its 
approach  to  finding  an  efficient  plan.  Rather  than  integrating  decisions  about  the  planner’s  focus  of 
attention  with  decisions  about  the  choice  of  refinement,  including  those  choices  that  bear  on 
efficiency,  QUIST  moves  considerations  of  efficiency  into  a  preplanning  step.  In  this  step, 
constraints  arc  added  to  the  statement  of  the  problem  itself.  The  constraints  are  added  not  as  the 
result  of  elaboration  of  a  plan  step,  but  rather  for  the  express  purpose  of  having  the  planner  work  on 
a  new  but  equivalent  problem  for  which  a  more  efficient  plan  may  be  generated.  In  other  words: 


A  preliminary  reformulation  of  a  problem  statement  can  be  used  to  achieve  a  more  efficient 
solution  to  the  problem,  thereby  avoiding  explicit  and  possibly  costly  analysis  of  efficiency 
factors  during  the  actual  process  of  producing  the  solution. 

The  result  is  that  the  conventional  query  optimizer,  viewed  as  a  planner,  can  be  much  simpler  than 
it  would  have  to  be  if  it  tried  to  add  new  constraints  to  the  plan  in  order  to  make  the  plan  more 
efficient. 

Is  it  really  necessary  to  simplify  the  planner  in  this  manner?  Both  Sproull  and  Kant  have  claimed 
that  their  systems  not  only  produce  efficient  plans  but  do  so  with  an  efficient  planner.  In  fact,  despite 
some  investigation  of  the  issue,  the  cost  of  planning  is  not  a  crucial  factor  in  PEGASUS’s  travel 
planning  domain  nor  in  LIBRA’S  program  synthesis  domain.  That  is  not  to  say  that  an  integrated 
control  strategy  may  not  be  appropriate.  It  docs  suggest  that  further  investigation  is  needed  to 
determine  where  that  strategy  is  worthwhile.  In  any  event,  both  PEGASUS  and  LIBRA  work  with 
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essentially  fixed  problem  statements;  new  constraints  enter  only  in  the  refinement  to  executable 
plans.  (Interestingly,  the  planner  reported  in  [Haycs-Roth78]  does  not  have  a  fixed  problem 
statement;  the  planner  is  free  to  choose  which  of  many  tasks  to  perform.  Tlic  “goodness”  of  the  plan, 
is  loosely  related  to  how  many  tasks  can  be  carried  out  using  the  completed  plan.) 

The  separation  between  problem  reformulation  and  problem  solving  raises  an  issue  that  is  not 
present  in  typical  planning  programs:  how  is  problem  reformulation  controlled?  In  QUIST.  problem 
reformulation  (the  semantic  transformation  of  the  query)  is  controlled  by  means  of  the  constraint 
target  list  (Section  4.3.1).  The  constraint  targets  are  determined  from  knowledge  about  the  possible 
opportunities  for  finding  less  expensive  ways  to  search  the  files  involved  in  the  query.  In  the 
terminology  of  planning: 

The  process  of  reformulation  of  the  problem  for  the  sake  of  efficiency  can  be  guided  by 
knowledge  about  the  cost  of  processing  alternative  refinements  of  abtract  plan  steps. 

That  is,  there  is  a  two  way  flow  of  information.  Not  only  docs  problem  reformulation  change  the 
class  or  possible  plans  to  include  more  efficient  plans,  but  also  the  information  about  the  cost  of  plan 
operators  that  the  planner  uses  can  be  abstracted  to  guide  the  reformulation  process. 

To  summarize.  then,  semantic  query  optimization  as  formulated  in  the  QUIST  system  offers  a  new 
method  for  achieving  a  “good"  solution  to  a  problem  when  a  method  for  finding  correct  solutions 
already  exists.  The  new  method  consists  of  reformulating  the  statement  of  the  problem  into  an 
equivalent  form  for  which  better  solutions  may  exist.  The  process  of  reformulating  die  problem 
statement  is  controlled  by  using  an  approximate  model  of  die  kinds  of  solutions  produced  by  the 
associated  problem  solver. 


6.2.2  Intelligent  database  mediation 

A  user  who  wishes  to  access  a  database  in  order  to  solve  a  problem  faces  several  difficulties.  For 
one  thing,  die  user  may  not  know  what  information  is  contained  in  the  database.  For  another,  he 
may  not  know  what  concepts  the  database  uses  in  general  and  what  terminology  is  used  to  refer  to 
them.  Even  if  the  user  aderstands  die  database’s  structure  and  terminology,  he  may  not  know  how 
they  relate  to  his  own  concepts  and  terms  for  die  problem  domain. 

In  conventional  database  installations,  the  user  must  either  puzzle  out  dicse  problems  on  his  own, 
or  else  he  has  recourse  to  die  services  of  a  database  analyst  or  liaison.  Hie  analyst  mediates  between 
die  database  resources  and  die  user  solving  a  problem.  The  analyst  applies  knowledge  of  both  the 
problem  domain  and  of  the  capabilities  and  limitations  of  the  database  to  pose  the  most  effeedve  and 
easily  processed  queries  that  can  help  solve  the  original  problem.  The  analyst  supplies  certain 
knowledge  about  die  database  which  the  user  lacks  in  order  to  make  the  most  effective  use  of  the 
database.  Of  course,  the  analyst  must  know  enough  about  die  problem  domain  in  order  to  do  this 
sensibly. 
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With  the  development  of  interactive  query  systems,  it  is  expected  that  average  users  will  interact 
more  directly  with  databases,  without  the  aid  of  a  database  analyst.  It  is  clear,  however,  that  better 
facilities  must  be  created  to  perform  some  of  these  intelligent  database  mediation  functions 
automatically. 

Some  aspects  of  intelligent  database  mediation  have  been  explored.  McLeod’s  IFAP  (Interaction 
Formulation  Advisor  Prototype)  [McLeod78]  supplies  knowledge  of  the  classes  of  entities  known  to 
the  database  about  which  a  user  can  pose  queries.  In  the  LADDER  system  [Hendrix78],  a  user’s 
natural  language  query  is  transformed  into  a  retrieval  language  query  to  the  appropriate  network  site 
and  database.  That  is,  one  of  LADDER’S  functions  is  to  supply  knowledge  of  the  distribution  of  data 
among  sites  and  databases,  knowledge  that  the  user  lacks. 

Semantic  query  optimization  is  significant  as  an  application  of  artificial  intelligence  methods  to  one 
aspect  of  the  intelligent  database  mediation  problem: 

As  an  intelligent  database  mediator,  a  semantic  query  optimization  system  employs 
detailed  knowledge  of  semantic  constraints  on  the  data  and  detailed  knowledge  of  the 
physical  organization  of  the  database,  knowledge  that  a  user  should  not  be  expected  to 
know  or  to  be  able  to  use. 

In  addition, 

Semantic  query  optimization  is  the  first  effort  to  apply  semantic  reasoning  to  the  task  of 
providing  efficient  access  to  pre-existing  computer  resources. 

We  are  not  claiming  that  the  present  research  has  discovered  the  problem  of  intelligent  database 
mediation  nor  that  it  has  devised  entirely  new  solutions  to  that  problem.  Rather,  the  present  research 
should  serve  to  encourage  additional  applications  of  artificial  intelligence  techniques  to  database 
mediation  and  other  database  problems. 

In  the  future,  wc  will  want  computer  systems  to  be  increasingly  knowledgeable  not  just  about  the 
answers  to  specific  questions,  but  also  about  the  range  of  knowledge  sources  which  it  can  access  and 
the  ways  in  which  those  sources  can  be  used.  The  research  'eported  here  L  a  step  in  that  direction. 

6.3  Limitations  and  directions  for  future  research 

Database  retrieval  is  a  very  important  activity.  Semantic  query  optimization  holds  the  promise  of 
substantial  improvements  in  this  activity.  Therefore,  it  is  worthwhile  examining  how  the  ideas 
advanced  in  this  research  arc  limited,  and  how  their  future  usefulness  might  be  extended.  We  are 
particularly  concerned  with  how  semantic  query  optimization  can  be  extended  to  other  data  models 
and  system  architectures,  how  additional  kinds  of  semantic  knowledge  can  be  employed  for  efficient 
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processing  of  queries,  and  how  methods  to  control  semantic  query  optimization  may  have  to  be 
extended. 


6.3.1  Data  models  and  database  architectures 

The  QUIST  system  operates  in  the  context  of  a  subclass  of  the  relational  model  of  daw  which  we 
have  referred  to  as  an  aitribute/constraint  model  (Section  4.2.1).  We  have  indicated  that  this  includes 
an  important  class  of  queries.  Yet,  it  may  be  desirable  to  extend  the  system  so  that  it  is  relaiionally 
complete  [Codd71],  To  do  this,  it  would  be  necessary  to  drop  the  assumption  that  there  is  a  single 
logical  path  between  any  two  relations.  This  would  require  a  somewhat  more  complicated 
representation  of  semantic  rules,  because  the  logical  path  between  relations  would  have  to  be 
specified.  Whether  the  extra  complexity  would  be  merited  by  the  frequency  of  queries  outside  the 
range  now  covered  by  QUIST  would  have  to  be  studied.  The  difficulty  would  be  to  retain  the 
potential  improvements  of  semantic  transformations  without  adding  the  complexity  of  a  general- 
purpose  theorem  proven 

Our  research  centered  on  the  relational  model  because  that  model  has  been  the  focus  of  attention 
of  much  recent  research  on  query  processing.  However,  semantic  reasoning  can  certainly  be  applied 
in  the  context  of  other  data  models.  For  example,  the  principle  of  “pushing  a  constraint  up  a 
hicraichy”  (Section  4.4.1.3)  certainly  makes  sense  in  a  hierarchical  or  network  database. 

We  also  adopted  a  conventional  model  of  data  storage  and  access  (Section  2.3).  This  model  or 
models  like  it  is  the  basis  for  most  research  in  query  optimization.  However,  there  is  growing  interest 
in  query  optimization  in  distributed  databases  ([Epstein78])  and  in  unconventional  database 
machines  ([Shaw80]).  More  generally,  there  is  recent  research  ((Lenat79],  [Katz80D  that  extends  the 
ideas  of  Wicdcrhold  ([Wicderhold77])  on  the  notion  of  the  binding  of  semantic  knowledge  to  data 
structures.  The  aim  of  this  research  is  to  develop  abstract  descriptors  and  rules  with  which  to  reason 
about  the  case  or  difficulty  of  realizing  the  physical  counterpart  to  a  logical  expression.  If  this  effort 
^  is  successful,  it  would  be  an  appropriate  vehicle  to  generalize  the  heuristics  of  QUIST  to  apply  to 
multiple  databases  and  unconventional  architectures. 


6.3.2  Semantic  knowledge 

The  semantic  knowledge  used  by  QUIST  involves  constraints  on  particular  values  stored  in  the 
database.  However,  there  arc  other  kinds  of  constraints  that  could  be  used  for  semantic  query 
optimization. 

Cardinality  constraints  specify  the  minimum  or  maximum  number  of  individuals  in  some  entity 
class  that  can  be  associated  with  an  individual  in  some  other  entity  class  by  means  of  a  particular  type 
of  relationship.  An  example  is:  "every  freshman  and  sophomore  must  have  at  least  two  faculty 
advisors."  Dependence  constraints  can  be  viewed  as  cardinality  constraints  in  which  at  least  one 
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related  entity  must  exist  Dependence  and  cardinality  constraints  are  particularly  significant  in  terms 
of  the  structural  integrity  of  databases  fEIMasri80b].  For  example,  the  constraint  that  "every  manager 
must  manage  exactly  one  department”  regulates  updates  and  deletion*  of  manager  and  department 
entities;  the  deletion  of  a  department  entity  forces  the  deletion  of  related  manager  entities,  and  no 
manager  entity  can  be  inserted  for  which  the  designated  related  department  entity  does  not  exist  in 
the  database.  However,  the  constraint  docs  not  determine  which  managers  can  be  related  to  which 
departments.  Among  the  structural  constraints  are  the  widely  discussed  “functional  dependencies” 
and  “multivalued  dependencies"  [Ullman80]. 

To  see  how  cardinality  and  dependence  constraints  could  be  used  for  semantic  query  optimization, 
consider  the  constraint  “every  student  except  those  with  an  independent  study  major  has  at  most  two 
advisors.”  Suppose  the  query  is:  “what  are  the  name  and  faculty  rank  of  the  advisors  of  history 
majors?"  One  way  to  process  the  question  would  be  to  find  each  history  major  in  turn,  and  then  to 
find  each  of  his  advisors  and  print  his  name  and  rank.  However,  we  know  that  there  are  at  most  two 
advisors  for  each  student,  so  the  search  for  a  given  student’s  advisors  can  stop  when  the  second 
advisor  has  been  found.  Notice  that  the  ability  to  exploit  the  constraint  on  the  number  of  advisors 
requires  a  different  control  strategy  than  QUIST’s.  Specifically,  it  requires  more  direct  control  of  the 
query  processor  itself  so  that,  for  instance,  a  limit  on  the  number  of  hits  from  some  file  can  be  set  and 
reset  as  needed.  By  contrast,  QUIST  works  entirely  at  die  level  of  transforming  the  “surface  level”  of 
queries.  The  only  thing  that  the  semantic  optimization  component  passes  down  to  the  query 
processor  is  a  query,  and  not  any  instructions  on  how  to  process  it. 

Another  kind  of  semantic  knowledge  is  what  can  be  termed  approximate  knowledge,  knowledge 
that  is  probabilistic  or  about  which  there  is  some  uncertainty.  It  includes  the  heuristics  or  rules  of 
thumb  that  help  experts  to  reason  effectively  in  their  area  of  expertise.  Approximate  knowledge 
could  be  applied  to  semantic  query  optimization  by  using  a  somewhat  different  strategy  than  QUIST 
uses,  one  that  is  itself  heuristic  in  nature.  Suppose  it  is  known  that  most  supertankers  arc  registered 
in  Panama,  Liberia,  or  Greece,  and  suppose  a  query  asks  for  the  names  of  three  supertankers  carrying 
crude  oil  to  Italy.  In  that  case,  it  is  likely  that  the  names  of  three  qualifying  supertankers  can  be 
found  merely  by  examining  those  registered  in  Panama,  Liberia,  or  Greece.  If  the  registration 
information  is  well  supported  in  the  database  (say,  by  an  index)  and  if  there  are  indeed  three 
supertankers  registered  in  one  of  those  countries  and  that  are  currently  carrying  crude  oil  to  Italy, 
then  it  is  prefer  and  effective  to  transform  the  question  so  that  it  references  the  country  of 
registration. 

This  strategy  offers  no  guarantee  that  the  substituted  query  gives  the  same  answer  as  the  original 
query.  Therefore,  a  more  sophisticated  system  is  needed  to  apply  this  new  strategy  effectively.  For 
instance,  suppose  there  arc  100  supertankers.  If  we  know  that  “most"  supertankers  arc  Liberian,  then 
it  seems  likely  that  questions  that  request  5  supertankers  can  be  answered  merely  by  referencing 
Liberian  tankers.  However,  it  may  not  be  effective  to  process  questions  that  request  95  supertankers 
by  looking  first  only  at  Liberian  tankers.  If  the  required  number  of  supertankers  are  not  found 
among  the  Liberian  ones,  the  search  must  be  renewed  among  all  supertankers.  The  system  must  be 
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sophisticated  enough  to  determine  when  it  is  probably  worthwhile  adding  constraints  to  the  query 
based  on  approximate  knowledge. 

Finally,  we  consider  the  use  of  knowledge  not  about  the  semantics  of  the  domain  but  about  the. 
relationship  between  concepts  defined  within  the  database  itself  (the  familiar  notion  of  logical  views). 
For  instance,  suppose  that  in  a  university  database  the  predicate  INSTRUCT  is  a  derived  relationship 
between  professors  and  students  defined  in  terms  of  fundamental  predicates  TEACH  and 
ENROLLED-IN  as  follows: 

Vp,s  lNSTRUCT(p,s)  m  3c.(TEACH(p,c)  A  ENROLLED-IN(s,c)) 

This  says  that  a  professor  p  instructs  a  student  s  if  and  only  if  there  is  some  class  c  that  the 
professor  teaches  and  in  which  the  student  is  enrolled.  Let  us  now  consider  the  query: 

“  What  professors  instruct  aS  the  students  whom  they  advise?" 


which  we  render  as: 

(p  I  Vs.  ADVISE(p,s)  — » !NSTRUCrS(p,s)> 

The  strategy  for  this  query  would  be  to  eliminate  all  professors  who  advise  more  students  than  they 
instruct,  for  in  that  case,  they  certainly  can‘t  instruct  every  student  whom  they  advise.  We  can 
conservatively  assume  that  each  student  is  enrolled  in  no  more  than  one  course  taught  by  any 
professor.  Then  for  every  professor  who  satisfies  the  query,  it  must  also  be  true,  from  the  definition 
of  “instructs”,  that  he  instructs  fewer  students  than  the  product  of  the  maximum  number  of  students 
in  any  one  class  and  the  maximum  number  of  classes  taught  by  any  professor.  Let  Cnt(S)  stand  for 
the  number  of  items  in  set  S,  and  Max(x,F(x))  stand  for  the  upper  bound  on  function  F(x)  for  any 
value  of  x.  In  addition,  let  I(p)  be  the  total  number  of  students  instructed  by  a  professor,  p.  That  is: 

Vp  I(p)  =  Cnt({s  |  INSTRUCT(p,s)» 

Then  the  conservative  upper  bound  can  be  expressed  as: 

Vp  I(p)  <  Max(c,(Cnt({s  |  ENROLLED-IN(s,c)»))  *  Max(p,(COUNT({c  J  TEACH(p.c)}))) 

If,  for  instance,  there  is  a  maximum  enrollment  of  20  students  in  any  course,  and  a  maximum 
teaching  load  of  3  courses  per  professor,  then  we  can  eliminate  from  consideration  any  professor  who 
advises  more  han  60  students.  This  conservative  bound  can  be  tightened  as  more  information  is 
gathered  during  query  processing.  Thus,  if  professor  X  teaches  2  courses,  one  with  12  students  and 
one  with  18,  he  can  be  eliminated  if  he  advises  more  than  30  students,  rather  than  the  conservative 
bound  of  60.  As  this  strategy  uses  cardinality  constraints,  it  relies  on  more  detailed  control  of  query 
processing  titan  QUIST  does. 
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6.3.3  Control  of  semantic  query  optimization 

In  Sections  6.1.2  and  6.2.1,  we  argued  the  merits  of  organizing  a  semantic  query  optimization 
system  as  a  preplanner.  That  is,  we  advocated  performing  all  semantic  transformations  of  the  query 
prior  to  generating  the  sequence  of  steps  for  actually  carry  ing  out  the  query  in  the  physical  database. 

However,  we  saw  an  example  in  the  last  section  in  which  knowledge  about  the  distribution  of 
current  data  values  could  provide  cost-reducing  constraints.  In  this  section,  as  in  Section  6.2.1,  we 
recommend  that  further  research  be  conducted  on  the  question  of  under  what  conditions  control  of 
semantic  query  optimization  should  be  integrated  with  the  processing  of  the  query  itself.  As  further 
justification  for  such  research,  we  offer  two  more  examples  of  data-dependent  semantic  query 
optimization. 

Consider,  for  example,  a  database  that  contains  data  about  ships  and  their  movements.  Assume 
that  the  most  frequent  queries  concern  the  current  status  of  American  ships,  so  that  a  small  file 
containing  duplicate  information  about  their  most  important  current  voyage  attributes  is  maintained. 
Whenever  a  position  report  is  received  on  an  American  ship,  both  the  regular  file  and  the  duplicate 
"highlights"  file  are  updated.  Suppose  that  a  user  poses  the  query: 

“Where  is  the  fastest  tanker*" 

If  the  nationality  of  the  ship  is  stored  with  its  speed  and  shiptype,  then  we  can  check  whether  the 
ship  is  American.  If  it  is,  then  we  only  have  to  look  for  its  position  in  the  small  file  of  American  ships. 
Otherwise,  we  have  to  look  through  the  larger  file  of  position  reports  for  all  ships.  If  the  fastest 
tanker  happens  to  be  American,  then  in  effect  the  original  query  can  be  transformed  into: 


“Where  is  the  fastest  American  tanker ?” 

But  this  transformation  is  only  supported  in  the  current  state  of  the  database.  There  is  no  integrity 
rule  prohibiting  the  insertion  of  another  record  representing  a  faster  tanker  of  another  nation.  Thus, 
this  transformation  is  inherently  dependent  upon  the  current  state  of  the  database. 

The  preceding  example  and  the  one  in  the  last  section  do  not  actually  use  any  rules  about  the 
application  domain;  they  only  use  relationships  internal  to  the  database.  Yet,  the  current  contents  of 
the  database  can  affect  the  application  of  a  domain  rule  as  well,  as  the  following  example  illustrates. 
Assume  we  have  simple  relational  database: 

SHIPS:  (Shipnamc  Shiptype  Length  Draft  Capacity) 

PORTS:  (Portnamc  Country  Depth  Facilitytype) 

VISITS:  (Ship  Port  Date  Cargo  Quantity) 
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and  the  following  two  semantic  integrity  rules  based  on  domain  knowledge: 

Rule  Rl.  “A  ship  can  visit  a  port  only  if  the  ship’s  draft  is  less  than  the 

channel  depth  of  the  port” 

Rule  R2.  “Only  liquefied  natural  gas  (LNG)  is  delivered  to  ports  that  are 

specialized  LNG  terminals.” 

Assume  that  each  relation  is  implemented  as  a  single  file  on  its  own  data  pages.  The  VISITS  file 
has  a  clustered  index  on  Cargo.  Now  consider  a  query  that  requests  the  ships,  dates,  cargoes  and 
quantities  of  visits  to  the  port  of  Zamboanga.  According  to  our  semantic  query  optimization 
heuristics,  it  is  desirable  to  infer  a  constraint  on  Cargo  from  the  given  constraint  on  Port. 

Imagine  that  instead  of  performing  semantic  transformations  in  a  preplanning  phase,  there  is 
integrated  control  of  semantic  transformation  and  data  retrieval.  Control  of  the  process  of  inferring 
cost-reducing  constraints  can  then  be  viewed  as  control  of  the  moves  in  a  space  of  constraints  on 
attributes.  Constraints  can  be  moved  either  by  applying  a  rule,  by  retrieving  items  restricted  on  one 
attribute  and  observing  their  values  on  other  attributes,  or  by  matching  constraints  on  attributes 
defined  on  the  same  underlying  set  of  entities. 

Continuing  the  example,  starting  with  a  constraint  on  the  Port  attribute  of  VISITS,  new  constraints 
can  be  found  by  retrieving  from  VISITS  or  by  assigning  the  value  “Zamboanga”  to  the  Portname 
field  of  PORTS.  The  first  choice  is  rejected  because  the  objective  is  to  reduce  the  cost  of  that  very 
retrieval.  With  a  constraint  on  Portname  in  PORTS,  a  retrieval  from  PORTS  can  be  performed.  In 
this  case,  just  a  single  record  will  be  obtained  because  Portname  is  the  unique  identifier  in  that  file. 
With  appropriate  access  methods,  such  as  hashing,  the  retrieval  will  be  very  inexpensive. 

When  the  PORTS  record  for  "Zamboanga”  has  been  obtained,  rules  Rl  and  R2  may  apply.  If  rule 
R2  applies,  that  is,  if  Zamboanga  is  a  specialized  liquefied  natural  gas  terminal,  then  a  strong 
constraint  will  be  obtained  on  the  goal  attribute  Cargo,  and  retrieval  from  VISITS  will  take  place  by 
means  of  an  indexed  scan  rather  than  by  means  of  a  more  expensive  sequential  scan.  If  the  data  on 
Zamboanga  docs  not  support  that  inference,  then  other  inference  paths  beginning  with  rule  Rl  will 
have  to  be  considered.  This  illustrates  the  possible  dependence  of  retrieval  planning  on  the  current 
contents  of  the  database. 

Whether  or  not  such  elaborate  control  is  worthwhile  is  certainly  open  to  question.  It  depends  in 
part  upon  what  kinds  of  processing  options  are  available;  it  seems  more  likely,  for  instance,  that  an 
integrated  strategy  makes  more  sense  in  a  distributed  database  with  redundant  files.  The  point  of 
these  examples  is  simply  to  indicate  the  value  of  further  research  on  this  issue. 
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6.4  Conclusion 

This  research  has  introduced  a  new  method  to  reduce  significantly  the  cost  of  processing  database 
queries.  The  method  uses  semantic  knowledge  that  is  otherwise  used  to  insure  the  validity  of 
database  entries.  It  applies  techniques  of  artificial  intelligence  to  the  problem.  At  the  same  time,  it 
suggests  a  new  approach  to  problem  solving  when  the  quality  of  the  desired  plan  is  important  and 
there  already  exists  a  generator  of  correct  plans.  This  approach  is  to  reformulate  the  problem 
statement  as  an  equivalent  problem  which  may  have  a  better  solution. 

To  be  useful  in  future  database  systems,  the  work  presented  here  must  be  extended  to  additional 
models  of  physical  data  storage  and  access  and  to  a  wider  range  of  logical  data  models.  Also, 
experience  is  needed  with  actual  database  systems  to  test  further  the  promising  results  obtained  under 
laboratory  conditions;  tests  of  query  processing  methods  are  generally  run  on  small  sets  of  invented 
examples,  but  this  is  not  a  suitable  practice  for  future  work.  Additional  research  is  needed  to 
investigate  when  a  problem  reformulation  strategy  can  be  applied  to  the  task  of  finding  good 
solutions  to  problems. 

Whatever  the  particular  merits  or  shortcomings  of  semantic  query  optimization  and  the  QUIST. 
system,  the  research  presented  here  suggests  the  value  of  work  at  the  intersection  of  database 
management  and  artificial  intelligence.  These  fields  are  important  and  exciting  and  have  a  great  deal 
to  offer  to  each  other. 
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Appendix  A 

The  QUIST  query  language 

A.1  Syntax  of  the  QUIST  query  language 

In  the  following  description,  the  metasymbol  “+”  means  one  or  more  instances  of  the  type  so 
designated,  and  the  metasymbol  “  |  ”  means  a  choice  between  the  items  it  separates.  In  the  actual 
QUIST  language,  the  tokens  ONEOF  and  NOTONEOF  arc  used  instead  of  the  symbols  £  and  £, 
respectively,  that  arc  used  in  the  examples  throughout  this  report 

<qucry> ::  =  (<se]cciions>  <restrictions>) 

<sclcctions>  ::=  (<attribute>+) 

<rcstrictions>  ::=  (<rcstriction>+) 

<restriction>  ::=  (<attribute>  <constraint>) 

<constraint>::=  <string  constraint  |  <integer  constraint 

<string  constraint  ::=  (ONEOF  (<string>+))  |  (NOTONEOF  (<string+>)) 

<intcgcr  constraint  ::=  (<interval>+) 

<intcrval>  ::=  (([GT | GE]  <integer>)  (|LT | LE]  <intcger>))  |  ((<comparator> 

<intcgcr>)) 

<comparator> ::  =  GT  |  GE  |  LT  |  LE  j  EQ  |  NE 

A. 2  Semantic  restrictions  on  the  language 

1.  An  attribute  can  appear  only  once  among  the  selections  or  among  the  restrictions. 

2.  An  integer-valued  attribute  can  only  be  constrained  by  an  integer  constraint,  and  a  string- 
valued  attribute  only  by  a  string  constraint 

3.  The  intervals  of  an  integer  constraint  must  not  conflict  This  requirement  is  enforced  as 
follows.  If  the  constraint  is  ((Compl  ValucI) ...  (CompN  ValucN)),  then  Valucl  <  Valuc2 
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<  ...  <  ValucN.  Furthermore,  each  GT/GE  term  that  is  not  last  in  the  list  must  be 
followed  by  a  LT/LE  term  (possibly  with  some  intervening  NE  terms),  and  each  LT/LE 
term  that  is  not  last  in  the  list  must  be  followed  by  a  GT/GE  term  (possibly  with  some 
intervening  EQ  terms). 
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Appendix  B 

QUIST  and  the  relational  calculus 


In  this  appendix,  we  show  that  the  queries,  semantic  equivalence  transformations,  and  integrity 
rules  of  QUIST  are  special  cases  of  relational  calculus  queries,  semantic  equivalence  transformations, 
and  integrity  rules,  respectively.  To  do  this,  we  indicate  how  to  determine  the  relational  calculus 
query  that  corresponds  to  a  QUIST  query.  Wc  also  show  how  the  process  of  semantic  equivalence 
transformation  in  QUIST  is  a  special  case  of  the  process  for  relational  queries. 

B.1  Generation  of  a  relational  calculus  query  from  a  QUIST  query 

A  QUIST  query  consists  of  a  qualification,  which  is  conjunction  of  constraints  on  database 
attributes,  and  an  output  list,  which  is  a  list  of  attributes  whose  values  are  requested.  Hence,  a 
QUIST  query  corresponds  to  an  open  query'  of  the  relational  calculus,  as  defined  in  Chapter  3.  A 
QUIST  query  is  simpler  than  a  relational  query  chiefly  in  two  respects:  join  terms  need  not  be  stated 
explicitly,  and  only  conjunctions  of  attribute  constraints  can  be  expressed.  The  major  difference  in 
form  arises  from  the  assumption  in  QUIST  that  every  attribute  is  associated  with  a  single  virtual 
relation,  hence  no  relation  need  be  specified.  Relational  calculus  queries,  on  the  other  hand,  employ 
tuple  variables  that  range  over  explicitly  specified  database  relations. 

To  generate  the  relational  calculus  query  that  corresponds  to  a  QUIST  query,  it  is  therefore 
necessary  to  determine  what  real  relations  are  involved  in  the  query  and  what  join  terms  arc  needed 
to  link  them  together  properly.  Because  only  conjunctive  queries  are  permitted,  every  relation  in  the 
database  that  is  involved  in  the  query  plays  one  of  only  four  possible  roles: 

1.  some  of  its  attributes  are  constrained  but  none  are  in  the  output  list; 

2.  some  attributes  are  in  the  output  list  but  none  are  constrained; 

3.  some  attributes  arc  constrained  and  some  arc  in  the  output  list; 

4.  its  attributes  are  neither  constrained  nor  designated  for  output,  but  it  is  joined  between 
two  (or  more)  other  relations. 


> 


A  tuple  variable  must  be  generated  for  every  relation  that  is  involved  in  the  query.  If  some 
attributes  of  the  relation  arc  designated  for  output  (Cases  2  and  3),  then  the  variable  appears  in  the 
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relational  query  target  list  If  not  (Cases  1  and  4),  then  the  variable  appears  as  an  existentially  bound 
variable  within  the  qualification. 

If  some  of  the  relation’s  attributes  are  constrained  (Cases  1  and  3).  then  it  is  necessary  to  generate  a. 
restriction  term  in  the  corresponding  tuple  variable.  The  restriction  term  is  the  conjunction  of  the 
restrictions  on  the  attributes  of  the  relation. 

A  QUIST  query  does  not  mention  any  relations  that  arc  joined  to  others  but  are  not  otherwise 
involved  in  the  query  (Case  4).  The  fact  that  these  relations  are  involved  is  determined  in  the  course 
of  generating  the  required  mappings  (join  terms)  among  the  other  involved  relations  (Cases  1, 2,  and 
3). 

It  is  fairly  simple  to  generate  the  mappings  among  the  query  relations  because  there  is  one  and 
only  one  logical  access  path  or  mapping  between  any  two  relations  in  the  database;  this  is  a 
distinguishing  characteristic  of  QUIST  and  of  other  attribute/constraint  relational  query  languages. 
Consequently,  we  can  choose  any  relation  involved  in  a  query  and  regard  it  as  the  root  of  a  tree  of 
relations.  Every  other  relation  in  the  query  can  be  reached  via  some  unique  path.  Indeed,  the 
structure  of  a  query  should  be  viewed  as  a  subtree  of  QUISTs  virtual  relation.  The  virtual  relation, 
defined  in  Chapter  4,  is  a  tree  structure  in  which  all  the  real  database  relations  are  linked  by  way  of 
uniquely  specified  sequences  of  joins. 

B.2  The  generation  algorithm 

We  use  the  tree-structure  property  of  the  virtual  relation  to  generate  the  relational  query  from  the 
QUIST  query.  The  major  steps  of  the  algorithm  are; 

•  Step  1.  Determine  directly  from  the  QUIST  query  which  database  relations  have  either 
constrained  or  output  attributes  (Cases  1, 2,  and  3). 

•  Step  2.  Designate  one  of  these  relations  as  the  root  of  the  tree  of  query  relations. 

•  Step  3.  Choose  a  relation  found  in  Step  1.  Generate  a  tuple  variable  for  it  and  link  it  up 
to  the  relations  that  have  already  been  chosen  by  generating  the  appropriate  join  terms. 
Generate  a  restriction  term  for  the  relation  if  necessary.  Repeat  until  all  such  relations 
have  been  chosen. 

•  Step  4.  Formulate  the  relational  query  from  the  tuple  variables,  restriction  terms,  and 
join  terms. 

We  now  describe  ’he  steps  of  the  algorithm  in  more  detail. 

1 .  '  .  *  stcr  iids  all  relations  that  arc  constrained  or  part  of  the  outpuL  Go  through  restrictions 
in  the  QUIST  qualification.  If  the  attribute  in  the  current  restriction  is  associated  with  a  relation  X 
that  has  not  yet  been  encountered,  then  add  X  to  the  set  S£  of  constrained  relations.  Next,  go  through 
the  QUIST  output  list  For  each  new  relation  X  encountered,  add  X  to  the  set  S^  of  output 
relations.  Finally,  let  S^  be  Sc  U  Sout,  the  set  of  all  relations  so  far  involved  in  the  query. 
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In  the  second  step,  we  efcoose  any  relation  X  from  S^,  to  be  the  root  relation  to  which  all  the  other 
relations  will  be  linked.  We  generate  a  tuple  variable  x  for  this  relation.  If  X  is  in  S  the  set  of 
relations  involved  in  the  output,  then  variable  x  is  placed  in  set  Q ,  the  set  of  variables  that  will  be  in 
the  target  list;  the  subscript  “free"  conveys  the  idea  that  variables  in  the  target  list  appear  free  in  the 
qualification  of  the  relational  query.  Otherwise,  x  is  placed  in  Q^,  the  set  of  variables  that  will  be 
existentially  bound  in  the  qualification.  If  X  is  in  Sc,  the  set  of  relations  involved  in  QUIST 
constraints,  then  a  restriction  term  P(x)  is  generated.  P(x)  is  the  conjunction  of  restrictions  on 
attributes  associated  with  X.  For  example,  if  the  restrictions  (A^  {“a”  “b”})  and  (A2  E  ([50  100)) 
are  the  QUIST  constraints  associated  with  relation  X,  then  P(x)  is  given  by: 

P(x)  =  (x.A1  =  “alVfx.Aj  =  "b”))A((x.A2*50)A(x.A2<100)). 

The  third  step  is  the  most  complicated  and  requires  further  discussion  and  some  terminology.  The 
key  concept  is  that  of  the  logical  access  path  or  mapping  between  any  two  relations  X  and  Y.  A 
QUIST  mapping  is  the  unique  expression  of  join  terms  by  which  the  two  relations  can  be  linked. 
Assume  that  tuple  variable  x  ranges  over  relation  X  and  tuple  variable  y  ranges  over  relation  Y.  Let 
M(x,y)  ( =  M(y,x))  denote  the  mapping  between  X  and  Y  in  terms  of  these  tuple  variables.  If  X  and 
Y  arc  “neighbors”  in  the  sense  that  they  arc  permitted  to  be  joined  together  directly,  then  M(x,y)  is 
simply 

M(x,y)  =  J(x,y) 

where  J(x,y)  is  the  join  tetm  ((x.A)  =  (y.B))  for  some  prespecified  attributes  A  and  B  of  X  and  Y, 
respectively.  This  specification  is  part  of  the  definition  of  the  QUIST  virtual  relation. 

Suppose,  however,  that  the  virtual  relation  is  defined  in  such  a  way  that  X  and  Y  are  only  allowed 
to  be  linked  via  intermediate  relation  Yr  This  means  that  the  relational  counterpart  of  the  QUIST 
query  involves  relation  Yt  as  a  necessary  joining  “bridge”  between  X  and  Y.  The  relational  query 
now  involves  a  term  that  represents  this  mapping  between  X  and  Y: 

M(x.y)  =  3yt  |  J(x.yj)  A  J(yry). 

That  is.  for  every  qualifying  pair  of  tuples  (x,y)  from  X  and  Y  there  must  be  some  tuple  yt  in  Yj  that 
supports  the  mapping  by  way  of  the  two  prespecified  joins. 

In  general,  some  sequence  of  relations  Y^Y, . Yn  intervenes  between  X  and  Y  in  the  predefined 

mappings  of  the  virtual  relation,  where  we  adopt  the  convention  that  the  lower  the  subscript,  the 
closer  the  relation  is  to  X.  The  mapping  expression  is  then  given  by: 

M(x.y)  =  3yt ...  3yn  |  J(x,yj)  A  J(yry2)  A  ...  A  J(y„,y) 
where  the  intermediate  conjuncts  correspond  to  the  prcspccified  allowable  joins. 

Even  though  relations  X  and  Y  are  involved  in  the  constraints  or  output  designated  by  the  QUIST 
query,  it  may  be  that  some  or  all  the  relations  Y;  that  connect  X  and  Y  are  not  involved  in  that  way. 
However,  these  relations  must  be  specified  in  the  relational  query  counterpart  to  the  Q^JST  query. 
They  constitute  the  Case  4  relations  defined  above. 
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The  idea  of  tliis  next  step  of  the  algorithm  is  to  Jink  up  the  relations  in  S^,  one  at  a  time  by  linking 
each  one  to  the  root  relation  X,  introducing  new  “bridge"  relations  as  needed.  Every  time  a  new 
relation  is  selected  for  linking  or  is  introduced  as  a  bridge  relation,  a  new  tuple  variable  is  generated. 

Some  complications  may  arise  in  the  linking  process.  For  one  tiling,  some  or  all  of  the  necessary 
mapping  expression  may  have  been  generated  when  a  previously  chosen  relation  was  linked  up. 
Intuitively,  this  occurs  when  the  new  relation  lies  farther  out  along  the  same  branch  from  X  as  the 
preceding  relauon,  or  when  the  new  and  preceding  relations  have  a  common  ancestor  relation 
between  them  and  X.  Therefore,  each  linking  step  only  introduces  as  much  of  the  mapping 
expression  as  necessary.  One  other  possible  complication  is  that  the  new  relation  may  have  been 
introduced  already  as  part  of  the  bridge  between  X  and  a  preceding  relation.  In  this  case,  a  new  tuple 
variable  should  not  be  generated  for  it 

We  now  resume  die  description  of  the  third  step  in  the  algorithm.  Let  Y  be  the  new  restriction  in 
S^,  that  is  to  be  linked  up.  Assume  for  the  moment  that  Y  has  not  been  previously  introduced  as  a 
bridge  relation.  Therefore,  we  generate  a  new  tuple  variable  y  to  range  over  Y.  If  Y  is  a  Case  1 
relation,  constrained  but  not  part  of  die  output,  then  variable  y  is  placed  in  the  set  Q^,  as  it  will 
appear  existentially  bound  in  the  relational  query.  Otherwise,  y  is  placed  in  Q(  because  it  will  be  in 
the  query’s  target  list 

Wc  must  now  introduce  some  part  of  the  mapping  expression  M(x,y).  If  we  denote  X  by  YQ,  then 
the  definition  of  the  virtual  relation  specifies  that  there  are  n  relations  Y0  through  Y  n  >  0,  between 
X  and  Y.  Let  Yk.  0  <  k  <  n,  be  the  relation  with  the  highest  subscript  among  dicse  relations  that  have 
already  been  linked  into  the  query;  that  is,  YQ  through  Yk  have  already  been  linked  in.  Therefore,  it 
is  only  necessary  to  complete  the  link  from  Yk  to  Y. 

In  other  words,  instead  of  generating  the  mapping  expression  M(x,y),  wc  generate  the  mapping 
expression  M(yk,y).  In  addiuon,  wc  generate  tuple  variables  yk+1  through  y  and  place  these  in  the 
set  Q3  because  they  enter  the  relational  query  as  existentially  bound  variables.  Of  course  if  n  =  0, 
meaning  that  X  can  be  joined  directly  to  Y,  or  if  k  =  n,  meaning  that  all  intervening  relations  have 
already  been  linked  in,  then  no  new  tuple  variables  are  generated. 

Finally,  if  Y  is  in  Sc,  that  is,  if  it  is  constrained  in  the  QUIST  query,  then  wc  generate  a  restriction 
term  P(y)  in  the  same  way  that  we  generated  a  restriction  term  P(x)  for  relation  X  in  step  2. 

Now  suppose  that  relation  Y  had  been  introduced  previously  as  a  bridging  relation.  We  do  not 
need  to  link  Y  to  X  because  every  bridging  relation  is  automatically  linked  to  X.  Wc  do  not  need  to 
introduce  a  new  tuple  variable  y,  because  tliis  has  already  been  done.  However,  wc  do  have  to  check 
whether  Y  is  in  S  „  the  set  of  relations  involved  in  the  output.  If  so,  wc  must  transfer  y  from  the  set 
Qj  where  it  was  originally  placed  as  a  bridging  variable,  to  the  set  Q  so  it  will  end  up  in  the  target 
list  Also,  we  must  generate  a  restriction  term  P(y)  if  Y  is  in  Sc,  the  set  of  constrained  relations. 

Tliis  concludes  the  description  of  step  3  of  the  algorithm.  When  this  has  been  done  for  all  relations 
in  S  j|,  wc  are  ready  for  the  last  step,  the  actual  generation  of  the  relational  query. 
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The  target  list  of  the  query  is  simply  the  list  of  variables  in  Q .  The  qualification  of  the  query  is  a 
conjunction  of  restriction  terms  P(y)  for  all  relations  Y  in  Sc,  and  join  terms  J(y.y')  generated  in  the 
linking  step.  We  can  distinguish  the  P  and  J  terms  on  whether  they  con'ain  any  variables  in  Q3.  Call 
the  conjunctions  of  these  terms  P(Q3),  P(Q().  J(Q3),  and  J(Qt).  (Note  that  J(Q3)  can  have  terms  that 
refer  to  both  a  bound  and  a  free  variable.)  We  generate  existential  quantifiers  for  all  variables  in  Q3. 
They  will  be  in  the  form 

Let  us  abbreviate  this  as  3(Q3).  Then  the  rf'ational  query 
{(Qt)  I  P(Qt)  A  J(Qt)  A  3(Qa)  (P(Q3)  A  J(Q3))} 

It  really  doesn’t  matter  if  we  place  the  free  variables  within  the  quantifiers,  so  we  can  equivalently 
express  the  query  as 

{(Qt)  I  3(Q3)(P(QaH)AJ(Qall))}. 

where  of  course  the  subscript  “all”  refers  to  all  variables,  free  or  bound.  This  emphasizes  the 
similarity  of  the  relational  form  to  the  original  QUIST  form:  a  simple  conjunction  of  terms. 

B.3  QUIST  semantic  rules  and  their  relational  counterparts 

We  now  have  a  quite  simple  representation  of  a  QUIST  query  in  terms  of  the  relational  calculus. 
Next,  we  consider  QUIST  rules  and  transformations  in  terms  of  their  relational  counterparts. 

For  productions,  we  consider  all  the  relations  associated  with  the  attributes  involved  in  the  rule.  If 
we  group  the  constraints  by  relation,  we  start  with  an  expression  like 

Yx.yj . yn  P(x)  A  P/y^  A  ...  A  Pn(yn)  -*  P(x) 

but  to  this  we  must  add  the  appropriate  join  terms  to  insure  that  die  relations  arc  properly  linked. 
We  select  X  as  the  root  relation.  The  process  of  generating  the  join  terms  is  then  very  much  like  the 
one  previously  described  to  build  up  a  query.  In  particular,  we  may  introduce  additional  variables 
that  will  appear  existentially  bound  in  the  rule.  Let  Rf  refer  to  variables  that  arc  present  without 
being  introduced  for  linkage  purposes;  R3  refer  just  to  those  existentially  bound  variables;  and  R#„ 
refer  to  all  variables  of  either  kind.  Then,  using  the  same  kind  of  abbreviations  as  in  our  description 
of  queries,  the  relational  form  of  the  production  is 

VRr(3(R3)J(RaI,)APr(R|))-P(x). 

A  bounding  rule  obviously  involves  cither  one  relation  X  or  two  relations  X  and  Y.  The  case  of  one 
relation  is  quite  simply  Vx  P(x)  where  P(x)  is  a  comparison  between  two  attributes  of  relation  X.  For 
two  relations,  the  proper  notion  is  Vx,y  M(x,y)  -♦  P(x,y).  That  is,  given  the  proper  mapping 
conditions  between  X  and  Y,  possibly  involving  intervening  relations  (hence  other  existentially 


3v13vr..3vk 

if  Vj  through  vk  are  the  variables  in  Q3> 
can  be  written  as; 
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bound  variables),  then  some  predicate  P(x,y),  the  comparison  between  attributes  of  X  and  Y,  holds. 
The  form  of  the  bounding  rule  turns  out  to  be  very  similar  to  the  form  of  the  production.  Recalling 
that  M(x,y)  may  involve  some  existentially  bound  variables,  we  can  write  the  bounding  rule  as 

Vx,y(3(R3)J(RaU»-P(x,y). 


B.4  QUIST  transformations  and  their  relational  counterparts 

QUIST  transformations  can  now  be  seen  to  be  a  special  case  of  the  semantic  equivalence 
transformations  for  relational  queries  described  in  Chapter  3.  We  illustrate  this  just  for  the  case  of  a 
production.  We  start  with  some  query  in  the  form:  «. 

{(Qt)  I  3(Q3)(P(Qa]l)  A  JfQgj,))}. 

Suppose  we  wish  to  infer  a  constraint  on  the  relation  X  using  a  production 

VRf  (3(R3)  J(Ral|)  A  Pr(Rr»  -*  P(x).  V’ 

The  QUIST  transformation  corresponds  to  dropping  all  the  universal  quantifiers  from  this 
expression,  leaving  one  in  which  Rf  are  free  variables.  In  order  to  carry  out  the  transformation,  two 
conditions  must  be  met.  First,  every  relation  ranged  over  by  the  variables  in  Rf  must  be  a  query 
relation  ranged  over  by  some  variable  in  Q^,.  This  condition  also  guarantees  that  the  query  has  the 
requisite  join  terms.  Second,  every  restriction  term  in  the  conjunction  Pf(Rr)  must  be  at  least  as 
strong  as  the  corresponding  restriction  term  in  the  conjunction  P(Qal|).  If  these  two  conditions  are 
met,  then  upon  application  of  the  logical  schema 

(A  A  (A  — *  B))  =  (A  A  B) 

the  constraint  P(x)  can  be  conjoined  to  the  query  expression,  where  x'is  the  variable  in  the  query  that 
corresponds  to  x  in  the  rule. 

There  is  one  additional  case  where  the  transformation  can  be  made,  that  of  join  introduction.  In 
that  case,  the  relation  X  is  not  already  part  of  the  query  even  though  all  the  antecedent  conditions  of 
the  rule  are  met  We  wish  to  add  P(xJ  to  the  query.  The  only  way  to  do  so  is  to  add  in  the  necessary 
join  terms  to  link  X  to  the  existing  query  relations.  That  is,  we  want  to  link  X  into  the  query  in  just 
the  same  way  that  we  described  above  for  constructing  a  query  step  by  step.  Obviously,  x  itself  is  not 
already  in  the  target  list,  so  x  will  be  existentially  bound  in  the  query.  Also,  any  intermediate 
relations  needed  to  link  in  x  will  be  existentially  bound.  Hence,  we  seek  to  introduce  a  conjunction  of 
join  terms  that  involve  existentially  bound  variables,  such  as 

3yj . 3yn,3x  Jfy.y^  A ...  A  J(yn,x) 

where  y  is  the  variable  that  ranges  over  relation  Y,  the  relation  already  in  the  query  to  which  X  can  be 
linked.  This  expression  can  be  conjoined  to  the  original  query  without  altering  the  answer  if  and  only 
if  every  tuple  in  Y  satisfies  it;  that  is,  if  and  only  if  the  structural  integrity  constraint 
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